Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

Othman, Nashwan Adnan; Aydin, Ilhan

doi:10.3390/drones7030148

Open AccessArticle

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

by

Nashwan Adnan Othman

^1,2,* and

Ilhan Aydin

²

¹

Department of Computer Engineering, College of Engineering, Knowledge University, Erbil 44001, Iraq

²

Department of Computer Engineering, Firat University, Elazig 23200, Turkey

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(3), 148; https://doi.org/10.3390/drones7030148

Submission received: 30 December 2022 / Revised: 9 February 2023 / Accepted: 19 February 2023 / Published: 21 February 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

There has been increased attention paid to autonomous unmanned aerial vehicles (UAVs) recently because of their usage in several fields. Human action recognition (HAR) in UAV videos plays an important role in various real-life applications. Although HAR using UAV frames has not received much attention from researchers to date, it is still a significant area that needs further study because of its relevance for the development of efficient algorithms for autonomous drone surveillance. Current deep-learning models for HAR have limitations, such as large weight parameters and slow inference speeds, which make them unsuitable for practical applications that require fast and accurate detection of unusual human actions. In response to this problem, this paper presents a new deep-learning model based on depthwise separable convolutions that has been designed to be lightweight. Other parts of the HarNet model comprised convolutional, rectified linear unit, dropout, pooling, padding, and dense blocks. The effectiveness of the model has been tested using the publicly available UCF-ARG dataset. The proposed model, called HarNet, has enhanced the rate of successful classification. Each unit of frame data was pre-processed one by one by different computer vision methods before it was incorporated into the HarNet model. The proposed model, which has a compact architecture with just 2.2 million parameters, obtained a 96.15% success rate in classification, outperforming the MobileNet, Xception, DenseNet201, Inception-ResNetV2, VGG-16, and VGG-19 models on the same dataset. The proposed model had numerous key advantages, including low complexity, a small number of parameters, and high classification performance. The outcomes of this paper showed that the model’s performance was superior to that of other models that used the UCF-ARG dataset.

Keywords:

unmanned aerial vehicles; human action recognition; deep learning; UCF-ARG dataset; convolutional neural networks; depthwise separable convolutions; unusual human actions; UAV; lightweight CNN model; CNN architectures

1. Introduction

Autonomous unmanned aerial vehicles (UAVs) have become more popular recently because they can be used for different tasks and are capable of monitoring remote areas. Vision-equipped UAVs are becoming ever more common and are now used in numerous fields. UAVs are also becoming increasingly popular because of their small size, ease of use, and capability of autonomous flight. One of the main reasons for their widespread adoption is their ability to be programmed for autonomous flight, which makes them highly efficient for a variety of tasks and applications. They are now being actively used for security, agriculture monitoring, traffic management, search–rescue, and pollution monitoring tasks. However, unusual human actions can also cause serious safety problems [1,2]. Terrorist organizations can utilize these illegal actions against key security forces in private and sensitive areas. The use of UAVs equipped with deep learning and computer vision abilities, however, has made it possible to remotely monitor and enforce safety measures in urban environments. These capabilities can be particularly utilized to improve the security of smart cities. By utilizing an onboard camera, UAVs can detect and track the movements of individuals, thereby assisting police officers in maintaining security in smart cities. By utilizing UAVs in combination with other technologies, such as forensic mapping software, video streaming, secure wireless communications, and abnormal human action recognition, it is now possible to create a safer urban environment for residents [3,4].

Human action recognition (HAR) is a rapidly evolving field within machine learning and deep learning, with key applications for security, robotics, sports, and healthcare. This technology has the potential to detect falls in older individuals and identify unusual actions [5,6,7]. The ability of humans to recognize the actions of others is a central focus of research in machine learning and computer vision. There are several applications for this technology, including the classification of human behavior for use in robotics, video surveillance systems, and human–computer interaction [5]. The identification of human activities, particularly via video taken using UAVs, has garnered significant attention from researchers. HAR is a vital aspect of human-to-human communication and personal relations, and now an active research area with potential positive and useful applications in a variety of fields [8,9].

Despite its potential, identifying human action from video frames taken via UAVs remains challenging due to perspective changes, dynamic and complex backgrounds, camera height, human parallax, and other issues with the platform [10]. In summary, HAR systems now exhibit certain limitations, with some actions having low recognition rates. HAR systems perform complex tasks that involve detecting and analyzing physical actions, as well as determining a person’s identity, personality, and psychological state. Further study is essential to improve accuracy and expand the range of detectable actions. We predict that UAV-based HAR will soon become a feasible solution as the high-performance computing technology becomes capable of using a vision-based approach to analyze large quantities of data. This will enable UAVs to recognize human actions quickly and accurately in real time. After reviewing empirical research in the literature, we have determined that UAV-based HAR methods can execute efficiently on UAVs. Nonetheless, several challenges must be addressed in order to fully realize the potential of these technologies [10].

Today, the implementation of modern smart city infrastructure has become a global trend, with the aim of improving the quality of life and safety of citizens through the use of advanced technologies. One such technology that has proven to be effective in achieving the goals of smart cities is UAVs. In conjunction with other innovative technologies, UAVs can provide cost-effective solutions in the delivery of efficient infrastructure and services in smart cities. The ultimate goal of intelligent city design is to reduce costs while simultaneously improving residents’ standard of living [3]. The integration of UAVs into technologies such as HAR can enhance the safety and security of smart cities. By using HAR to monitor and analyze human activities in UAV video frames, it is possible to identify and respond to unusual or potentially dangerous situations in real time. Additionally, HAR can be utilized to orient and guide drones in intelligent city environments. Overall, the integration of UAVs into HAR offers a powerful solution in the creation of safer and more secure smart cities.

Deep learning models have gained popularity in recent years for their use in intelligent applications in UAVs [4]. This is owing to deep learning models providing faster and more precise outcomes than traditional machine learning approaches for such tasks as HAR. Deep learning is a type of machine learning that involves the use of artificial neural networks to analyze and classify patterns in data. These neural networks are composed of input, hidden, and output layers that work together to extract meaningful information and make predictions or classifications. There are several types of deep learning algorithms, including convolutional neural networks (CNNs), deep neural networks (DNNs), restricted Boltzmann machines (RBMs), recurrent neural networks (RNNs), deep belief networks (DBNs), deep Boltzmann machines (DBMs), and auto-encoders. CNNs are particularly useful for processing images as they are designed to directly extract visual patterns through pre-processing stages [11]. CNNs were developed by Yann LeCun in 1990 and are a type of neural network that uses an advanced propagation algorithm to learn from training data [12]. Additionally, CNNs have feature extraction and summarization layers, which contribute to their superior performance compared to other classification algorithms [12,13,14,15]. In this study, deep learning techniques including CNNs are used in the first and second stages of the research for human detection and action recognition.

This approach uses a pre-trained deep learning model for human detection and a low-dimensional CNN structure for HAR. The CNN includes a depthwise separable convolutional layer to maintain a high level of accuracy and reduce the number of trainable parameters. The main contributions of this paper are as follows:

Efficient detection of human actions in UAV frames, using the proposed lightweight CNN model with the help of depthwise separable convolutions.
The proposed approach is suitable for implementation on multiple embedded systems that are deployed in specific and sensitive areas.
Introduce a new method that can be executed in real-time-based human action monitoring to make the city smarter at a low cost.
Recognizing the actions of humans that can be used for military purposes or building smart cities for search and rescue processes and ensuring that relevant center authorities can easily monitor the results on their smartphones.
Developing a method(s) that will improve the performance of the UAV-based HAR for search and rescue processes, which allows more effective and precise calculations than the existing methods in the literature.

The remainder of the text is organized as follows: In Section 2 of this study, we review research related to using UAV frames for human action recognition. In Section 3, we comprehensively describe the steps of our proposed approach. Section 4 presents and compares the experimental results with other leading-edge methods. Finally, in Section 5, we provide concluding remarks and directions for future research.

2. Review of Related Works

The following section presents various methods that have been developed for HAR using UAVs. A comparison of these methods is provided in Table 1.

Hazar Mliki et al. [16] recently proposed a framework for recognizing human activity using UAVs. The approach involves two phases: an offline phase, in which a pre-trained CNN is used to create models for human identification and human action, and an inference phase, in which these models are used to detect humans and their actions in a scene. To improve the accuracy of the model, the authors used scene-stabilizing pre-processing to identify potential activity areas and extracted spatial features to create a human action model. They used the GoogLeNet architecture, which includes multiple convolutions of various sizes and a pooling layer in place of a fully connected layer, to strike a balance between calculation time and classification error rate. To adapt the GoogLeNet architecture for their HAR system, they added an additional softmax layer with the number of neurons corresponding to the number of human actions. They tested the performance of the HAR approach on the UCF-ARG [17] dataset and found the precision rate for the test set to be 68%.

Waqas Sultani et al. [18] introduced two new action datasets in their research: a game-action dataset consisting of seven human actions with 100 aerial video pairs, and a real aerial dataset consisting of eight actions from UCF-ARG. They used disjoint multitask learning (DML) on the game dataset, generative adversarial networks (GAN) to generate aerial video footage, and real aerial video footage. They extracted deep features from a small number of real aerial videos and gameplay videos using a 3D CNN and used GAN to generate aerial features for the game dataset [22]. The network for their approach included two fully connected layers shared among all tasks and one fully connected layer for each task. The authors found that combining video game and GAN-generated action samples with a DML structure can improve activity classification accuracy. They trained each part of the network for classification utilizing a softmax activation function and cross-entropy loss.

In ref. [19], a drone was utilized to capture 13 dynamic human activities, resulting in 240 high-definition video clips totaling 44.6 min and 66,919 UAV frames. The dataset was collected at a low altitude and low speed in order to capture detailed information about human positions. To analyze the dataset, the authors used two feature types: Pose-based CNN (P-CNN) [23] and High-Level Pose Features (HLPF) [24]. P-CNN was the primary method for action recognition and utilized CNN features of body parts extracted using predictable postures. HLPF, on the other hand, recognizes activity classes based on the temporal relationships and ranges of body joints, which were calculated using 15 key points on the body. The overall baseline HAR accuracy utilizing P-CNN was 75.92%. In addition to evaluating the performance of their activity recognition approach, the authors also compared their results and experimental details to those of other available datasets for HAR.

In the study by Kotecha et al. [20], an architecture for motion feature modeling is proposed that is designed to be resistant to background effects, making it useful for disaster and search and rescue applications. Their architecture utilizes two modules: faster motion featuring modeling (FMFM), which models temporal features and creates background-invariant clips as inputs for the second module, accurate action recognition (AAR). Additionally, a new dataset captured from top-view aerial surveillance with a wide range of actors, environments, and day times is proposed. The study also suggests a model for temporal action recognition with 90% accuracy.

Liu et al. [21] developed a human detection and gesture recognition system for rescue in UAVs. The system uses a UAV equipped with a 640 × 480 resolution camera to locate a human at a distance and sends a notification to enter the recognition stage when that person is detected. The dataset consists of 10 actions, including standing, kicking, sitting, punching, and squatting, all recorded by the drone’s cameras. The two most important dynamic gestures are “new dynamic attention” and “cancel,” which are used to establish and reset a connection with the UAV, respectively. If “cancel” is recognized, the system will shut off; if “new dynamic attention” is recognized, the user can establish a connection with the UAV. When a person comes into view and makes a dynamic gesture, the system enters the last phase of hand gesture identification to help the user. If the user’s rescue motion is recognized as a warning, the drone will approach the user to identify the hand gestures.

The literature review presents several previous studies on human activity recognition using UAVs, including approaches that use pre-trained CNNs, scene-stabilizing pre-processing, and various architectures such as GoogLeNet, 3D CNN, P-CNN, and SSD. These studies have achieved varying levels of precision, with the highest reported being 75.92% using P-CNN and the lowest being 68% using GoogLeNet. The proposed method in this paper is a new deep-learning model based on depthwise separable convolutions called HarNet, which has been designed to be lightweight and tested on the publicly available UCF-ARG dataset. It achieved a 96.15% success rate in classification, outperforming other models such as MobileNet, Xception, DenseNet201, Inception-ResNet-V2, VGG-16, and VGG-19 on the same dataset. It has advantages including low complexity, a small number of parameters, and high classification performance.

3. Methodology

In this study, we present a computer vision system that is capable of detecting humans and recognizing their actions from frames captured by UAVs. The proposed approach aims to enhance the accuracy of human activity classification in the context of HAR from UAV-captured video frames. The developed method can be used in smart city and military environments to detect different human actions, including unusual actions, with a low computational-cost drone by using an embedded platform such as Jetson. In addition, the proposed architecture and dataset for human detection and action recognition can be utilized to identify situations where humans are seeking assistance. The dataset needs to be pre-processed by using scene stabilization. After that, human detection will be processed through deep learning methods. CNN network architecture is used to train the models to classify human actions. The proposed approach integrates three steps: pre-processing of the dataset, human cropping, and the human action model generation. The datasets were pre-processed in the first step. Then, each human in the frames was detected and cropped in the second step. In the third step, human action classification was performed using a deep CNN architecture. The general block diagram for the proposed approach for HAR is shown in Figure 1.

3.1. Proposed Dataset

In this paper, we utilized a UCF-ARG dataset, which was a collection of multi-view videos of human actions performed by 12 individuals. Each action was performed by each actor in four different directions. The dataset was captured using three cameras: an aerial camera, a rooftop camera at a height of 100 feet, and a ground camera. All videos were recorded in high definition (1920 × 1080 resolution) at a rate of 60 frames per second. A sample of the aerial UCF-ARG dataset is shown in Figure 2.

In this study, we selected the six most important actions from 10 actions in the aerial camera views of the UCF-ARG dataset, which were boxing, walking, digging, throwing, running, and waving. Before training the model, we pre-processed the dataset using various computer vision techniques. One of these techniques involved applying a scene stabilization method to decrease the ego-motion in the UAV frames. We also used human detection to crop and resize the human parts of the frames to 120 × 60 pixels. The cropped human frames were then divided into a training set and a test set. Table 2 provides a description of the proposed dataset and the number of human frames after the human detection and cropping process.

3.2. Pre-Processing of the Dataset

Since the dataset for the UAV-based HAR was gathered in the form of videos, we needed to pre-process the raw data so we could fully set up the proposed dataset. The pre-processing steps for the dataset were as follows:

3.2.1. Video Stabilization

Video stabilization is a process that helps eliminate or reduce camera shake and vibration in a video. This is especially useful when shooting video with a handheld camera because it can help produce a smoother and more professional-looking final product. It can be effective in reducing or removing camera shake and vibration and is widely utilized in a variety of applications including action cameras, smartphone cameras, and professional video cameras. To pre-process the dataset, we used computer vision techniques to stabilize the videos by reducing the motion of the acquisition platform, known as ego-motion. This step is crucial in the video surveillance field when using UAVs as sensors. We performed scene stabilization process on UCF-ARG dataset, which contains various human actions. There are two types of motion in the video: global and local. Global motion is caused by the movement of the camera and affects both static and dynamic objects in the scene. It is used in the motion compensation step of scene stabilization to detect background objects. Local motion refers to the movement of dynamic objects within a scene. There are a variety of techniques that can stabilize video, including hardware solutions such as gimbals and software solutions such as digital video stabilization algorithms. These methods can apply to a variety of video types, including action cameras, smartphone cameras, and DSLR cameras. Digital video stabilization technique does not require special sensors for estimating camera motion. In this study, we used point feature matching for video stabilization. This technique involves identifying specific points in two consecutive frames and following their movement. By tracking these features, we can determine the movement between the frames and make adjustments. This process can be used to stabilize both global and local motion.

The basic steps of point feature matching include reading the frame and converting it into grayscale, finding motion between the frames, calculating smooth motion between the frames, and applying smoothed camera motion to the frames. Video stabilization involves capturing two frames of a video, estimating the movement between those frames, and then correcting that movement to produce a stable video. Step 2 of the algorithm involves finding the motion between frames. The Lucas-Kanade optical flow algorithm is used to identify motion occurring among consecutive frames by iterating through all those frames and comparing the current frame to the previous one. Good features to track are chosen using the goodFeaturesToTrack function in OpenCV. The motion is then estimated using the estimate RigidTransform function, which decomposes that motion into an x and y translation and rotation. The resulting values are then stored in an array. In the next step, to calculate the smooth motion between the frames, we cumulatively add the differential motion valued in the prior stage to find the trajectory of the motion. The goal is to smooth this trajectory, and to achieve this, we can use a moving average filter. This filter replaces the value of a function at a specific point with the average of the values of the function at points within a defined window around it. For example, if we have a curve stored in an array “c” and want to apply a moving average filter of width 5 to smooth that curve, we can calculate the kth element of the smoothed curve “f” using the following equation:

f[k] = (c[k − 2] + c[k − 1] + c[k] + c[k + 1] + c[k + 2])/5

(1)

In Equation (1), c [0] … c[n − 1] are points on the curve. The resulting smooth curve is the average of the noisy curve over a small window. To obtain a smooth curve, the values of a noisy curve can be averaged over a small window, as shown in Figure 3. The noisy curve is on the left, while the smoothed curve on the right is produced using a box filter with a size of 5.

After that, to obtain smooth transforms, we use the smooth trajectory to calculate the difference between it and the original trajectory. We will then add this difference to the original transforms. The next step is to apply the smoothed transforms to frames by looping over them and using the transformation matrix (T). The transformation matrix for a motion specified as (x, y, θ) is given by:

T = [\begin{matrix} \cos 𝛳 - \sin 𝛳 x \\ \sin 𝛳 \cos 𝛳 y \end{matrix}]

(2)

One potential issue for video stabilization is the appearance of black boundary artifacts. This is a natural byproduct of the process, as stabilizing a video often requires shrinking the size of the individual frames. However, there are ways to mitigate this issue, such as applying a small-scale transformation to the video that centers around its center. To fix border artifacts in a stabilized video, we scale the video by a small amount (e.g., 4%) around its center using the function getRotationMatrix2D in OpenCV with 0 rotation and a scale of 1.04 (4% upscale). This will scale and rotate the image without moving its center.

3.2.2. Human Detection and Cropping

The proposed system includes a step for detecting and cropping out each human in the dataset using an object-detection method. There are several options for object detection, including Histogram of Oriented Gradients with Linear Support Vector Machine; Haar Cascade; and deep-learning-based methods such as You Only Look Once (YOLO) [25], Single Shot Detectors (SSDs) [26,27], and Recurrent Convolutional Neural Networks (R-CNNs) [28]. R-CNNs are known for their high accuracy but have a slow processing speed. YOLO and SSDs, on the other hand, use a one-stage detection approach and are faster but less accurate than two-stage detectors. YOLO and SSDs treat object detection as a regression problem, using an input image to predict bounding-box coordinates and class-label probabilities.

The proposed method is designed for use on an embedded platform, where speed is a critical factor. To ensure that the implementation can run in real time on a standard central processing unit, we will use a pre-trained YOLO object-detection model to detect all people in the video frames. YOLO is a highly efficient single-stage detector that has been trained on the Common Objects in Context (COCO) dataset [29]. It divides the image into an S × S grid and generates estimates for B bounding boxes, confidence levels for those boxes, and C class probabilities for each grid cell. These estimates are represented as an S × S × (B × 5 + C) tensor. Figure 4 shows the YOLO object-detection pipeline.

In this stage, we will use YOLOv3 [30]; specifically, a version of YOLO that has been trained on the COCO dataset, which includes 80 object labels. Although the YOLO model is able to detect a wide range of objects, we will use it only to identify people. YOLOv3 is the most advanced member of the YOLO family of object detectors, and although it uses a larger network than previous models, it has demonstrated superior performance. Leveraging the pre-trained YOLO object-detection model will enable us to obtain bounding boxes and associated probabilities for the detected people in the video frames. After applying the object detection model, it is common to obtain multiple bounding boxes surrounding the target object. To reduce the number of bounding boxes, we will apply Non-Maximum Suppression to remove bounding boxes that are too close to each other. We will set a minimum object-detection confidence level and an NMS threshold to filter out weak, overlapping bounding boxes and ensure that at least one reliable detection is made. Figure 5 shows the people-detection stage after applying the YOLO object-detection model and NMS.

After human detection, each cropped human will be resized as shown in Figure 6. Then, the resized frames will be normalized by dividing by 255 so that each pixel value then lies among 0 and 1. The normalized frames are then appended to a list. The features will be extracted from the cropped images and put into a NumPy array by using deep learning techniques. Later the dataset will be divided into the train set and test set before the training stage.

3.3. Proposed HarNet Model for UAV-Based HAR

After obtaining images in the pre-processing stage, the next step will involve developing a CNN model for HAR-based UAV frames using deep learning techniques to make the predictions. The cropped images will be modified to fit the dimensions required for the input layer of a CNN model, which is usually utilized for image classification tasks. The CNN model uses filters to extract significant features from the image’s pixels, and it consists of numerous modules, such as convolution, ReLU, batch normalization, pooling, softmax, and a fully connected layer [31].

A CNN is an artificial neural network designed to process and recognize images. It is composed of various layers, including convolutional, activation, pooling, and dense layers. In a convolutional layer, filters are applied to the input image to create a feature map. Mathematical operations are then performed to extract significant features and patterns from the input image. It is common to apply the ReLU activation function after each convolutional layer to prevent the vanishing gradient problem [32]. Pooling layers are used to reduce the size of the output from the previous layer in a neural network and to summarize the presence of certain features in patches of the feature map. There are three types of pooling: max pooling, mean pooling, and sum pooling. Max pooling is the most commonly utilized method in CNNs and involves extracting sub-regions of the feature map, taking the maximum value, and discarding all other values. A dense layer, also known as a fully connected layer, is employed to classify the features extracted from the convolutional and pooling layers. In a dense layer, every node is connected to every node in the previous layer [33]. The final dense layer in a CNN holds one artificial neuron for each target class and uses a softmax activation function to generate probabilities from 0 to 1 for each artificial neuron, with the sum of these probabilities equaling 1 [34].

Various techniques have been developed to improve the performance of CNNs. One approach is to increase the number of layers in the network, which enhances its learning ability but also increases the number of parameters and complexity. LeNet [35] architecture uses this structure. Another way to enhance the performance of CNNs is to design new types of connections and create networks with a smaller number of parameters because of connections formed in the blocks. For example, the GoogLeNet architecture has a greater number of layers than the AlexNet architecture, but significantly fewer parameters compared to AlexNet architecture with around 4 million. However, the ResNet architecture shows that greater layer number does not always lead to better performance. In fact, the ResNet50 architecture, which has 49 convolutional layers and one fully-connected layer, performs better than GoogLeNet despite having slightly fewer parameters. This suggests that number of layers alone is not the determining factor in the performance of a neural network [36]. Inception-ResNet modules have been found to accelerate the training process for a network due to their use of 1_1 convolution layers [37]. Meanwhile, the ShuffleNet V2 architecture reduces both parameter number and network size by utilizing 1_1 convolution layers and depthwise convolution layers [38]. The ImageNet dataset is a popular choice for training deep learning models; such models can also be trained on other large, general datasets and then fine-tuned for a specific task through transfer learning. However, if the images in the pre-trained network are significantly different from the images in the target application, the performance of the model may not improve through transfer learning. In such cases, it may be more effective to use a model with a lower-weight structure.

In this research, a CNN model was designed to perform HAR using video frames captured by a drone. To make the model more lightweight, depthwise separable convolutions were used instead of standard convolutions. In a depthwise separable convolution, a depthwise convolution is applied to each individual channel of the input, and this results in a single channel for each input volume. These single channels are then combined using a point-wise convolution in the second stage. The use of one by one point-wise convolution layers allows for a more efficient computation process. Figure 7 shows the comparison of depthwise separable convolution filters to standard convolution operations. The computational complexity for a standard convolution layer with a kernel size of D_k and input and output sizes of M × W × H and N × W × H, respectively, can be calculated using Equation (3). The computational cost for the depthwise convolution process is given in Equation (4), and the computational cost for the point-wise convolution process is given in Equation (5).

Standard convolution cost = D_k² × M × W × H × N

(3)

Depthwise convolution cost = D_k² × M × W × H

(4)

Pointwise convolution cost = 1 × 1 × M × W × H × N = M × W × H × N

(5)

In the above equations, D_k represents the size of the kernel and W and H are the width and height, respectively.

In this study, we explored the use of deep CNN models for classifying a dataset of images taken by UAVs. The best results were found to be those of the 31-layer HarNet-CNN model, as shown in Figure 8. This model proved to be highly effective at identifying human actions in the UAV images. A modified model has been proposed that utilizes depthwise separable convolution operations rather than standard convolution layers in order to reduce the number of parameters. The standard convolution operation has a number of parameters calculated as D_K² × M × N, while the suggested model has a number of parameters computed as M(D_K² + N). This led to a significant reduction in both the model size and computational cost. The proposed HarNet model had 31 layers, which included the standard convolution layer, depthwise convolution layer, pointwise convolution layer, and fully connected layers. Figure 8 shows that the proposed HarNet model had one standard convolution layer, nine depthwise separable convolution layers, nine spatial dropout layers, and three fully connected layers. One standard convolution layer was used as the stem layer of the model. A stem layer acts as a compression mechanism over the initial image. This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs. After the stem layer, a zero padding process was used to preserve the dimensions of the input volume. In a CNN, zero padding involves adding rows and columns of zeros around a matrix to preserve features at the edges of the matrix and control the size of the output feature map. After zero padding is applied, a depthwise separable convolution is typically performed, followed by a spatial dropout layer. In this model, the dropout layer was utilized to decrease the size of the data at the end of the layers and prevent the network from overfitting. The activation function utilized in the convolutional layers was the ReLU, which was formulated as follows:

h(x) = max (0; x)

(6)

In addition, the proposed architecture utilized global average pooling (GAP) instead of max-pooling in the last layer and flattening in order to decrease the number of parameters and address the issue of overfitting. As a result, only three fully connected (dense) layers were applied. The dense layers were used to create connections between neurons. In this method, we used a CNN model with fewer parameters and applied softmax-based classifiers to the features extracted from the layer before the fully connected layer in order to classify the type of action.

The classification algorithms being used in this context were K-nearest neighbors (KNN), random forest (RF), support vector machine (SVM), eXtreme gradient boosting (XGBoost), and softmax. KNN is a simple classification method that uses the distance between data points and their nearest neighbors to determine the class of data with two main parameters: the distance metric and the number of “neighbors” to consider. RF is an ensemble learning method that combines multiple decision tree classifiers to make predictions, with the number of trees being a key parameter. SVM is a linear classifier that represents data as N-dimensional vectors and aims to find the optimal solution for simple linear mapping. XGBoost is another decision tree ensemble method that generates individual trees using multiple cores and organizes data to minimize the look-up time, improving classification accuracy and reducing training time. Finally, softmax is a generalization of the logistic regression algorithm used for multiclass classification [39]. The softmax function is commonly used in CNNs for multiclass classification, along with the cross-entropy loss function. It produces posterior probability outputs and can be calculated using the Equation (7).

f_{i} (z) = \frac{e^{z_{i}}}{\sum_{j}^{C} e^{z_{j}}}

(7)

where z_i denotes the scores that are inferred by the net of each class in C. So, the loss function utilizes the form of cross-entropy (CE) loss as expressed in the Equation (8).

{L o s s}_{C E} = - \log (\frac{e^{z_{p}}}{\sum_{j}^{C} e^{z_{j}}})

(8)

where z_p denotes the model score for the positive class.

In recent years, there has been a growing trend in machine learning research of using hybrid approaches to improve performance in classification tasks. This typically involves using a CNN for feature extraction and combining the extracted features with another machine learning model for the final classification. The softmax function is a commonly used method for classification, especially in CNNs, and it was used as a baseline for comparison in this study. The aim of using a hybrid approach is to leverage the strengths of combining different models to achieve better performance than any single model alone by using CNN to extract relevant features and another model to make the final classification decision. For example, CNNs are good at extracting features from images, such as identifying edges or patterns, but may need to improve at making a final decision about what an image represents. Other models, such as the softmax function, may be better at making this final decision.

4. Results

This section shows the effectiveness of the proposed lightweight CNN-based method for recognizing human actions from UAV frames. With this method, we attempted to classify the six most important human actions in terms of the usual and unusual actions in the UAV frames: boxing, throwing, walking, running, digging, and waving. We tried to classify these actions with high accuracy. The proposed HarNet model was examined and compared with different CNN network architectures, such as MobileNet, Xception, DenseNet201, Inception-ResNet-V2, and VGG, used for the same dataset in terms of accuracy. In addition, the accuracy rates of our model were compared with the rates of models studied in the literature.

4.1. Performance Metrics

To evaluate the performance of the proposed classification technique, we calculated several parameters derived from the confusion matrix including accuracy, F1-score, recall, and precision. These parameters are represented as true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). Definitions and mathematical expressions for these metrics are as follows:

Accuracy measures how well a model can correctly anticipate the class or label of an input. Accuracy is calculated by dividing the number of correct predictions by the total number of predictions. The calculation is shown in Equation (9).
Recall measures the model’s ability to correctly classify all instances of a particular class. Recall is calculated by dividing the number of correct predictions for a specific class by the total number of instances of that class. The calculation is shown in Equation (10).
Precision measures the model’s ability to correctly predict positive instances of a particular class. Precision is calculated by dividing the number of correct positive predictions by the total number of positive predictions. The calculation is shown in Equation (11).
F1-score is a weighted average of precision and recall; a higher score indicates better performance. F1-score is determined by calculating the harmonic mean of precision and recall. The calculation is shown in Equation (12).

Accuracy = (TP + TN)/(TP + TN + FP + FN)

(9)

Recall = TP/(FN + TP)

(10)

Precision = TP/(FP + TP)

(11)

F1-score = 2 × ((Recall) × (Precision))/((Recall) + (Precision))

(12)

4.2. Experimental Results for the Proposed HarNet Model

In this study, we conducted an experiment using data containing various aerial human actions and developed a lightweight CNN model to classify these actions. We evaluated the performance of the proposed HarNet model using the UCF-ARG dataset. Our model was trained using a total of 42,147 images and achieved a six-class performance. The proposed lightweight CNN model was trained using the parameters listed in Table 3, with the default values chosen. During the backpropagation process, the weights and biases were updated using a set of 64 data points, referred to as a mini-batch. It is generally recommended that one select a mini-batch size that is a divisor of the total dataset size in order to balance the convergence rate and ensure an accurate estimation during the learning process. The last layer of the CNN architecture utilized the softmax activation function, and the epoch value for each cross-validation was set to 50. The Adam optimization method was also utilized. The model was implemented using the TensorFlow and Keras libraries in Python and was run on a GPU-supported system with an NVIDIA GeForce GTX 1070 graphics card, 16 GB of RAM, and an Intel Core i7-8700 processor on a 64-bit UBUNTU 20.04 operating system.

The proposed approach tended to increase the performance of human detection and HAR. Within this approach, the depthwise separable convolution was applied to produce a CNN model to classify human actions. The performance of the developed deep CNN model in classifying human actions shown on UAV frames was very high, as is clear in the results shown in Figure 9. The graph in Figure 9 displays the accuracy and loss values over 50 epochs for the parameters listed in Table 3. The maximum accuracy rate achieved after the training stage was 96.15%, a percentage that was obtained by dividing the dataset into 80% for training and 20% for validation. The success of the proposed lightweight CNN model in classifying human actions on UAV frames was found to be very high, as shown in Table 4. The prediction results for each class are shown in the confusion matrix in Figure 10.

Additionally, we compared the proposed HarNet model with other CNN models. In our comparison, the proposed HarNet model outperformed other CNN models, including the MobileNet, Xception, DenseNet201, VGG-16, VGG-19, and Inception-ResNet-V2 models, on the same dataset. Moreover, the proposed model had several advantages, including a high classification performance, low complexity, and a small number of parameters. Table 5 shows the comparison of the different CNN architectures with the proposed HarNet model in terms of accuracy, loss, and number of parameters.

The proposed model demonstrated a high level of success compared with other methods in the field that used UCF-ARG dataset. This superior performance was determined by comparing the number of recognized actions taken, the accuracy, and the approach utilized. The results of this comparison are presented in Table 6. Hazar Mliki et al. [16] performed HAR for UAV sequences from a UCF-ARG dataset by using GoogLeNet architecture, and the accuracy was 68%. Waqas Sultani et al. [18] proposed a method for improving HAR using UAVs by combining game videos with generated aerial features using a generative adversarial network (GAN). They proposed using a DML framework to effectively handle the different activity labels in the game and real-world UCF-ARG datasets. Their experiments and evaluations showed that combining video game activities and GAN-generated instances could enhance the accuracy of aerial recognition when used appropriately. Nouar Aldahoul et al. [40] used the UCF-ARG dataset, EfficientNetB7 architecture, and long short-term memory (LSTM) for model training with a success rate of 80%. Peng et al. [41], using a UCF-ARG dataset with a modified Inception-ResNet-V2 model through a combination of a three-dimensional CNN architecture and a residual network for HAR, had a success rate of 85.83%. According to this study, the main benefits of the proposed HarNet method were its lightweight design and superior classification performance, while its number of actions was higher than most of the other methods reported in the literature. Additionally, the proposed model had a smaller network architecture with fewer trainable parameters compared with the other methods and its overall size was small, making it suitable for running on embedded devices. This would be useful for applications that need security in various areas.

5. Conclusions and Future Work

Intelligent applications are essential for improving UAV performance, but identifying human activity in UAV video sequences remains a challenge that needs to be resolved. Therefore, it is crucial to develop a solution to accurately identify human actions within UAV video frames. In this paper, a new lightweight CNN named HarNet was proposed in order to detect human actions from UAV video sequences. The proposed method was evaluated with the UCF-ARG dataset, and it had high accuracy when identifying six common actions. The method, called the HarNet model, utilized features extracted from the global average pooling layer and a softmax classifier. This paper presents an economical solution that utilizes deep learning methods that can be applied to UAVs in various industrial applications, including security, surveillance, and military operations. Specifically, the proposed HarNet model can be used in military applications to autonomously attack in the case of specific dangerous areas, and in smart cities applications, it can be used to detect unusual human actions in real-time systems. The proposed approach in this research increases the performance HAR using a depthwise separable convolution to produce a lightweight CNN model. The performance of this model in classifying human actions on UAV frames is very high, with a maximum accuracy rate of 96.15% achieved after training with only 2.2 million parameters. Compared to other CNN models, including MobileNet, Xception, DenseNet201, VGG-16, VGG-19, and Inception-ResNet-V2, the proposed HarNet model outperforms them in terms of accuracy, loss, and number of parameters. Additionally, compared to other methods in the field using the UCF-ARG dataset, the proposed model has a higher success rate, and smaller network architecture with fewer trainable parameters, making it suitable for running on embedded devices.

In future work, we suggest using the proposed HarNet on a more extensive database by combining the UCF-ARG dataset with a new one. Furthermore, we recommend exploring the potential of future research to use the proposed method in innovative applications for UAVs, such as detecting armed humans in hazardous areas. Furthermore, the detection and sharing of unusual actions in sensitive or restricted areas can be performed with CNN network methods. Moreover, they can command UAVs and assign them essential tasks based on activities recognized through the proposed HarNet model.

Author Contributions

Conceptualization, N.A.O. and I.A.; methodology, N.A.O.; software, N.A.O.; validation, N.A.O. and I.A.; formal analysis, N.A.O. and I.A.; investigation, N.A.O.; resources, N.A.O. and I.A.; data curation, N.A.O.; writing—original draft preparation, N.A.O.; writing—review and editing, N.A.O. and I.A.; visualization, N.A.O.; supervision, I.A.; project administration, N.A.O. and I.A.; funding acquisition, N.A.O. and I.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

If anyone wants to obtain the original data of this paper, please contact with Nashwan Adnan Othman.

Conflicts of Interest

The authors declare no conflict of interest.

References

Abro, G.E.M.; Zulkifli, S.A.B.M.; Masood, R.J.; Asirvadam, V.S.; Laouti, A. Comprehensive Review of UAV Detection, Security, and Communication Advancements to Prevent Threats. Drones 2022, 6, 284. [Google Scholar] [CrossRef]
Yaacoub, J.-P.; Noura, H.; Salman, O.; Chehab, A. Security Analysis of Drones Systems: Attacks, Limitations, and Recommendations. Internet Things 2020, 11, 100218. [Google Scholar] [CrossRef]
Mohamed, N.; Al-Jaroodi, J.; Jawhar, I.; Idries, A.; Mohammed, F. Unmanned Aerial Vehicles Applications in Future Smart Cities. Technol. Forecast. Soc. Chang. 2020, 153, 119293. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the Unmanned Aerial Vehicles (UAVs): A Comprehensive Review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Zhang, N.; Wang, Y.; Yu, P. A Review of Human Action Recognition in Video. In Proceedings of the 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS), Singapore, 6–8 June 2018; pp. 57–62. [Google Scholar] [CrossRef]
Mottaghi, A.; Soryani, M.; Seifi, H. Action Recognition in Freestyle Wrestling Using Silhouette-Skeleton Features. Eng. Sci. Technol. Int. J. 2020, 23, 921–930. [Google Scholar] [CrossRef]
Agahian, S.; Negin, F.; Köse, C. An Efficient Human Action Recognition Framework with Pose-Based Spatiotemporal Features. Eng. Sci. Technol. Int. J. 2019, 23, 196–203. [Google Scholar] [CrossRef]
Arshad, M.H.; Bilal, M.; Gani, A. Human Activity Recognition: Review, Taxonomy and Open Challenges. Sensors 2022, 22, 6463. [Google Scholar] [CrossRef]
Aydin, I. Fuzzy Integral and Cuckoo Search Based Classifier Fusion for Human Action Recognition. Adv. Electr. Comput. Eng. 2018, 18, 3–10. [Google Scholar] [CrossRef]
Othman, N.A.; Aydin, I. Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey. Trait. Signal 2021, 38, 1403–1411. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions; Springer International Publishing: New York, NY, USA, 2021; Volume 8, ISBN 4053702100444. [Google Scholar]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
Hinton, G.E. A Practical Guide to Training Restricted Boltzmann Machines. In Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science; Montavon, G., Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7700, pp. 599–619. [Google Scholar] [CrossRef]
Mliki, H.; Bouhlel, F.; Hammami, M. Human Activity Recognition from UAV-Captured Video Sequences. Pattern Recognit. 2020, 100, 107140. [Google Scholar] [CrossRef]
CRCV | Center for Research in Computer Vision at the University of Central Florida. Available online: https://www.crcv.ucf.edu/data/UCF-ARG.php (accessed on 2 July 2021).
Sultani, W.; Shah, M. Human Action Recognition in Drone Videos Using a Few Aerial Training Examples. Comput. Vis. Image Underst. 2021, 206, 103186. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Perera, A.G.; Law, Y.W.; Chahl, J. Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition. Drones 2019, 3, 82. [Google Scholar] [CrossRef] [Green Version]
Cheron, G.; Laptev, I.; Schmid, C. P-CNN: Pose-Based CNN Features for Action Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3218–3226. [Google Scholar]
Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards Understanding Action Recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar] [CrossRef] [Green Version]
Kotecha, K.; Garg, D.; Mishra, B.; Narang, P.; Mishra, V.K. Background Invariant Faster Motion Modeling for Drone Action Recognition. Drones 2021, 5, 87. [Google Scholar] [CrossRef]
Liu, C.; Szirányi, T. Real-Time Human Detection and Gesture Recognition for on-Board UAV Rescue. Sensors 2021, 21, 2180. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
Song, W.; Shumin, F. SSD (Single Shot MultiBox Detector). Ind. Control. Comput. 2019, 32, 103–105. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context in Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef] [Green Version]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Haar, L.V.; Elvira, T.; Ochoa, O. An Analysis of Explainability Methods for Convolutional Neural Networks. Eng. Appl. Artif. Intell. 2023, 117, 105606. [Google Scholar] [CrossRef]
Alaslani, M.G.; Elrefaei, L.A. Convolutional Neural Network Based Feature Extraction for IRIS Recognition. Int. J. Comput. Sci. Inf. Technol. 2018, 10, 65–78. [Google Scholar] [CrossRef] [Green Version]
Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on Convolutional Neural Networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
Geist, M. Soft-Max Boosting. Mach. Learn. 2015, 100, 305–332. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2323. [Google Scholar] [CrossRef] [Green Version]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet V2: Practical Guidelines for Efficient Cnn Architecture Design. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11218, pp. 122–138. [Google Scholar] [CrossRef] [Green Version]
Wu, J.; Gao, Z.; Hu, C. An Empirical Study on Several Classification Algorithms and Their Improvements. In Advances in Computation and Intelligence; Cai, Z., Li, Z., Kang, Z., Liu, Y., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5821, pp. 276–286. [Google Scholar] [CrossRef]
A Comparison Between Various Human Detectors and CNN-Based Feature Extractors for Human Activity Recognition via Aerial Captured Video Sequences. Available online: https://www.researchgate.net/publication/361177545_A_Comparison_Between_Various_Human_Detectors_and_CNN-Based_Feature_Extractors_for_Human_Activity_Recognition_Via_Aerial_Captured_Video_Sequences (accessed on 20 December 2022).
Peng, H.; Razi, A. Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery. In Advances in Visual Computing; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12509. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the proposed HAR approach.

Figure 2. Examples of aerial human actions in the UCF-ARG dataset.

Figure 3. A noisy curve versus a smoothed version.

Figure 4. A simple diagram of how the YOLO object detector works [25].

Figure 5. Human detection stage.

Figure 6. Some samples of the cropped and resized human actions.

Figure 7. Standard convolution versus depthwise separable convolutions.

Figure 8. Proposed HarNet-CNN architecture.

Figure 9. Results for training and validation of the HarNet model for HAR.

Figure 10. Prediction results for HarNet model.

Table 1. Comparison table of the literature.

Authors	Number of Actions	Method	Dataset	Accuracy
Hazar Mliki et al. [16]	10 human actions are recognized, including digging, boxing, and running	CNN Model (GoogLeNet architecture)	UCF-ARG dataset [17]	68% Low accuracy is obtained
Waqas Sultani et al. [18]	8 human actions are recognized including biking, diving, and running	DML for HAR model generation and Wasserstein GAN to generate aerial features from ground frames	(1) Aerial-Ground game data set (2) YouTube-Aerial dataset (3) UCF-ARG	68.2% One limitation of DML is the need for access to various labels for each task for the corresponding data
A.G. Perera et al. [19]	13 human actions are recognized, including walking, kicking, jogging, punching, stabbing, and running	P-CNN	They used their own dataset called the Drone-Action dataset	75.92% The dataset was collected at low altitude and from low speeds
Kotecha et al. [20]	5 human actions are recognized, including fighting, lying, shaking hand, waving hand, waving both hands	FMFM and AAR	They used their own dataset	90% Reasonable accuracy is obtained
Liu et al. [21]	10 human actions are recognized, including walking, standing, and making a phone call	DNN model and OpenPose algorithm	They used their own dataset	99.80% At low altitude, high accuracy can be achieved

Table 2. Description of the proposed dataset.

Class Number	Action	Number of Actors	Number of Instances per Each Actor	Total Videos per Action	Number of Cropped Frames after Human Detection
1	Boxing	12	4	48	5844
2	Walking	12	4	48	7112
3	Digging	12	4	48	6239
4	Throwing	12	4	48	7511
5	Running	12	4	48	1939
6	Waving	12	4	48	6689
Total		72	24	288	35,334

Table 3. Training parameters of the proposed HarNet model.

Parameter	Value
Used software	Python
Framework	Keras and Tensorflow
Image size	120 × 60
Optimizer	Adam
Loss type	Categorical cross-entropy
Initial learning rate	1 × 10⁻⁶
Mini-batch size	64
Epoch	50
Activation function	softmax

Table 4. Classification report of the proposed HarNet model.

Action	Precision	Recall	F1-Score	Support
Boxing	0.9765	0.9878	0.9821	1391
Digging	0.9745	0.9778	0.9761	1485
Running	0.9561	0.6602	0.7810	462
Throwing	0.9780	0.9586	0.9682	1666
Walking	0.9157	0.9856	0.9494	1598
Waving	0.9703	0.9919	0.9810	1480
Accuracy			0.9615	8082
Macro average	0.9618	0.9270	0.9396	8082
Weighted average	0.9621	0.9615	0.9600	8082

Table 5. Accuracy, loss, and number of parameters of HarNet model compared with other CNN algorithms for classification.

Model Name	Parameters (Millions)	Loss	Accuracy
MobileNet	3.3	0.29	92.24%
VGG-16	14.7	0.39	86.54%
VGG-19	20.1	0.61	81.39%
Xception	20.99	0.49	87.32%
Inception-ResNet-V2	54.43	0.45	87.26%
DenseNet201	18.44	0.30	91.80%
Proposed HarNet model	2.2	0.07	96.15%

Table 6. Accuracy, loss, and number of parameters of HarNet model compared with other CNN algorithms for classification.

Mode	Accuracy	Number of Actions
Hazar Mliki et al. 2020 [16] Optical flow (stabilization) + GoogLeNet (classification)	68%	5
Waqas Sultani et al. 2021 [18] GAN + DML	68.2%	8
Nouar Aldahoul et al. 2022 [40] EfficientNetB7_LSTM (classification)	80%	5
Peng et al. 2020 [41] SURF (stabilization) + Inception-ResNet-3D (classification)	85.83%	5
Proposed HarNet model	96.15%	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Othman, N.A.; Aydin, I. Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos. Drones 2023, 7, 148. https://doi.org/10.3390/drones7030148

AMA Style

Othman NA, Aydin I. Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos. Drones. 2023; 7(3):148. https://doi.org/10.3390/drones7030148

Chicago/Turabian Style

Othman, Nashwan Adnan, and Ilhan Aydin. 2023. "Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos" Drones 7, no. 3: 148. https://doi.org/10.3390/drones7030148

Article Menu

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

Abstract

1. Introduction

2. Review of Related Works

3. Methodology

3.1. Proposed Dataset

3.2. Pre-Processing of the Dataset

3.2.1. Video Stabilization

3.2.2. Human Detection and Cropping

3.3. Proposed HarNet Model for UAV-Based HAR

4. Results

4.1. Performance Metrics

4.2. Experimental Results for the Proposed HarNet Model

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI