sensors-logo

Journal Browser

Journal Browser

Deep Learning Applications for Pose Estimation and Human Action Recognition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 20 October 2024 | Viewed by 12637

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy
Interests: deep learning; machine learning; computer vision; depth estimation; attitude and pose estimation

E-Mail Website
Guest Editor
Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4, 80143 Naples, Italy
Interests: navigation and positioning; attitude and pose estimation; 3D modeling; geomatics sciences; sensors; deep learning; computer vision

E-Mail Website
Guest Editor
Department of Computer, Control, and Management Engineering A. Ruberti, Sapienza University of Rome, 00185 Rome, Italy
Interests: multimedia forensics and security; machine learning; deep learning; computer vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the last decade, deep learning has drawn significant attention thanks to its robustness and potential in generalization and learning capabilities. Several applications have been tested and successfully deployed, exploring the majority of real-world tasks with the aim of improving their performances. Among others, pose estimation and human action recognition have benefitted from the exceptional results achieved in the deep learning field, although still showing wide margins of improvement.

This Special Issue aims to gather a significant collection of original contributions to these topics. Accurate estimation of vehicles and humans pose is crucial for several applications, e.g., animal behavior research, gaming and virtual reality, medicine and biotechnology, pedestrian, aerial and maritime navigation, robotics, and human motion tracking. Furthermore, effective human pose and action recognition offers an important contribution in many fields, such as physical therapists’ diagnoses and patient rehabilitation, as well as security and surveillance or employee-free store development.

The relevant topics of this issue include but are not limited to the following:

  • Single and multihuman pose estimation, action recognition, and tracking;
  • Terrestrial, maritime, aerial robot pose estimation, and tracking;
  • Literature reviews and surveys;
  • Datasets and sensors;
  • Interesting applications and ideas focusing on surveillance, autonomous navigation, human–robot interaction, healthcare and sports, etc.

Dr. Paolo Russo
Dr. Fabiana Di Ciaccio
Dr. Irene Amerini
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • action recognition
  • pose estimation
  • human activities
  • robotics and intelligent systems
  • navigation
  • positioning
  • control
  • datasets
  • sensors
  • embedded systems and devices

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

21 pages, 2576 KiB  
Article
A Deep Learning Model for Markerless Pose Estimation Based on Keypoint Augmentation: What Factors Influence Errors in Biomechanical Applications?
by Ana V. Ruescas-Nicolau, Enrique Medina-Ripoll, Helios de Rosario, Joaquín Sanchiz Navarro, Eduardo Parrilla and María Carmen Juan Lizandra
Sensors 2024, 24(6), 1923; https://doi.org/10.3390/s24061923 - 17 Mar 2024
Viewed by 569
Abstract
In biomechanics, movement is typically recorded by tracking the trajectories of anatomical landmarks previously marked using passive instrumentation, which entails several inconveniences. To overcome these disadvantages, researchers are exploring different markerless methods, such as pose estimation networks, to capture movement with equivalent accuracy [...] Read more.
In biomechanics, movement is typically recorded by tracking the trajectories of anatomical landmarks previously marked using passive instrumentation, which entails several inconveniences. To overcome these disadvantages, researchers are exploring different markerless methods, such as pose estimation networks, to capture movement with equivalent accuracy to marker-based photogrammetry. However, pose estimation models usually only provide joint centers, which are incomplete data for calculating joint angles in all anatomical axes. Recently, marker augmentation models based on deep learning have emerged. These models transform pose estimation data into complete anatomical data. Building on this concept, this study presents three marker augmentation models of varying complexity that were compared to a photogrammetry system. The errors in anatomical landmark positions and the derived joint angles were calculated, and a statistical analysis of the errors was performed to identify the factors that most influence their magnitude. The proposed Transformer model improved upon the errors reported in the literature, yielding position errors of less than 1.5 cm for anatomical landmarks and 4.4 degrees for all seven movements evaluated. Anthropometric data did not influence the errors, while anatomical landmarks and movement influenced position errors, and model, rotation axis, and movement influenced joint angle errors. Full article
Show Figures

Figure 1

15 pages, 2806 KiB  
Article
Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)
by Rui Li, An Yan, Shiqiang Yang, Duo He, Xin Zeng and Hongyan Liu
Sensors 2024, 24(2), 396; https://doi.org/10.3390/s24020396 - 09 Jan 2024
Viewed by 1010
Abstract
As an important direction in computer vision, human pose estimation has received extensive attention in recent years. A High-Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, the complex structure of the model is not conducive [...] Read more.
As an important direction in computer vision, human pose estimation has received extensive attention in recent years. A High-Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, the complex structure of the model is not conducive to deployment under limited computer resources. Therefore, an improved Efficient and Lightweight HRNet (EL-HRNet) model is proposed. In detail, point-wise and grouped convolutions were used to construct a lightweight residual module, replacing the original 3 × 3 module to reduce the parameters. To compensate for the information loss caused by the network’s lightweight nature, the Convolutional Block Attention Module (CBAM) is introduced after the new lightweight residual module to construct the Lightweight Attention Basicblock (LA-Basicblock) module to achieve high-precision human pose estimation. To verify the effectiveness of the proposed EL-HRNet, experiments were carried out using the COCO2017 and MPII datasets. The experimental results show that the EL-HRNet model requires only 5 million parameters and 2.0 GFlops calculations and achieves an AP score of 67.1% on the COCO2017 validation set. In addition, PCKh@0.5mean is 87.7% on the MPII validation set, and EL-HRNet shows a good balance between model complexity and human pose estimation accuracy. Full article
Show Figures

Figure 1

25 pages, 1469 KiB  
Article
PosturePose: Optimized Posture Analysis for Semi-Supervised Monocular 3D Human Pose Estimation
by Lawrence Amadi and Gady Agam
Sensors 2023, 23(24), 9749; https://doi.org/10.3390/s23249749 - 11 Dec 2023
Viewed by 789
Abstract
One motivation for studying semi-supervised techniques for human pose estimation is to compensate for the lack of variety in curated 3D human pose datasets by combining labeled 3D pose data with readily available unlabeled video data—effectively, leveraging the annotations of the former and [...] Read more.
One motivation for studying semi-supervised techniques for human pose estimation is to compensate for the lack of variety in curated 3D human pose datasets by combining labeled 3D pose data with readily available unlabeled video data—effectively, leveraging the annotations of the former and the rich variety of the latter to train more robust pose estimators. In this paper, we propose a novel, fully differentiable posture consistency loss that is unaffected by camera orientation and improves monocular human pose estimators trained with limited labeled 3D pose data. Our semi-supervised monocular 3D pose framework combines biomechanical pose regularization with a multi-view posture (and pose) consistency objective function. We show that posture optimization was effective at decreasing pose estimation errors when applied to a 2D–3D lifting network (VPose3D) and two well-studied datasets (H36M and 3DHP). Specifically, the proposed semi-supervised framework with multi-view posture and pose loss lowered the mean per-joint position error (MPJPE) of leading semi-supervised methods by up to 15% (−7.6 mm) when camera parameters of unlabeled poses were provided. Without camera parameters, our semi-supervised framework with posture loss improved semi-supervised state-of-the-art methods by 17% (−15.6 mm decrease in MPJPE). Overall, our pose models compete favorably with other high-performing pose models trained under similar conditions with limited labeled data. Full article
Show Figures

Figure 1

14 pages, 608 KiB  
Article
GaitSG: Gait Recognition with SMPLs in Graph Structure
by Jiayi Yan, Shaohui Wang, Jing Lin, Peihao Li, Ruxin Zhang and Haoqian Wang
Sensors 2023, 23(20), 8627; https://doi.org/10.3390/s23208627 - 22 Oct 2023
Viewed by 957
Abstract
Gait recognition aims to identify a person based on his unique walking pattern. Compared with silhouettes and skeletons, skinned multi-person linear (SMPL) models can simultaneously provide human pose and shape information and are robust to viewpoint and clothing variances. However, previous approaches have [...] Read more.
Gait recognition aims to identify a person based on his unique walking pattern. Compared with silhouettes and skeletons, skinned multi-person linear (SMPL) models can simultaneously provide human pose and shape information and are robust to viewpoint and clothing variances. However, previous approaches have only considered SMPL parameters as a whole and are yet to explore their potential for gait recognition thoroughly. To address this problem, we concentrate on SMPL representations and propose a novel SMPL-based method named GaitSG for gait recognition, which takes SMPL parameters in the graph structure as input. Specifically, we represent the SMPL model as graph nodes and employ graph convolution techniques to effectively model the human model topology and generate discriminative gait features. Further, we utilize prior knowledge of the human body and elaborately design a novel part graph pooling block, PGPB, to encode viewpoint information explicitly. The PGPB also alleviates the physical distance-unaware limitation of the graph structure. Comprehensive experiments on public gait recognition datasets, Gait3D and CASIA-B, demonstrate that GaitSG can achieve better performance and faster convergence than existing model-based approaches. Specifically, compared with the baseline SMPLGait (3D only), our model achieves approximately twice the Rank-1 accuracy and requires three times fewer training iterations on Gait3D. Full article
Show Figures

Figure 1

19 pages, 4744 KiB  
Article
DUA: A Domain-Unified Approach for Cross-Dataset 3D Human Pose Estimation
by João Renato Ribeiro Manesco, Stefano Berretti and Aparecido Nilceu Marana
Sensors 2023, 23(17), 7312; https://doi.org/10.3390/s23177312 - 22 Aug 2023
Viewed by 852
Abstract
Human pose estimation is an important Computer Vision problem, whose goal is to estimate the human body through joints. Currently, methods that employ deep learning techniques excel in the task of 2D human pose estimation. However, the use of 3D poses can bring [...] Read more.
Human pose estimation is an important Computer Vision problem, whose goal is to estimate the human body through joints. Currently, methods that employ deep learning techniques excel in the task of 2D human pose estimation. However, the use of 3D poses can bring more accurate and robust results. Since 3D pose labels can only be acquired in restricted scenarios, fully convolutional methods tend to perform poorly on the task. One strategy to solve this problem is to use 2D pose estimators, to estimate 3D poses in two steps using 2D pose inputs. Due to database acquisition constraints, the performance improvement of this strategy can only be observed in controlled environments, therefore domain adaptation techniques can be used to increase the generalization capability of the system by inserting information from synthetic domains. In this work, we propose a novel method called Domain Unified approach, aimed at solving pose misalignment problems on a cross-dataset scenario, through a combination of three modules on top of the pose estimator: pose converter, uncertainty estimator, and domain classifier. Our method led to a 44.1mm (29.24%) error reduction, when training with the SURREAL synthetic dataset and evaluating with Human3.6M over a no-adaption scenario, achieving state-of-the-art performance. Full article
Show Figures

Figure 1

17 pages, 4685 KiB  
Article
Research Method of Discontinuous-Gait Image Recognition Based on Human Skeleton Keypoint Extraction
by Kun Han and Xinyu Li
Sensors 2023, 23(16), 7274; https://doi.org/10.3390/s23167274 - 19 Aug 2023
Cited by 4 | Viewed by 791
Abstract
As a biological characteristic, gait uses the posture characteristics of human walking for identification, which has the advantages of a long recognition distance and no requirement for the cooperation of subjects. This paper proposes a research method for recognising gait images at the [...] Read more.
As a biological characteristic, gait uses the posture characteristics of human walking for identification, which has the advantages of a long recognition distance and no requirement for the cooperation of subjects. This paper proposes a research method for recognising gait images at the frame level, even in cases of discontinuity, based on human keypoint extraction. In order to reduce the dependence of the network on the temporal characteristics of the image sequence during the training process, a discontinuous frame screening module is added to the front end of the gait feature extraction network, to restrict the image information input to the network. Gait feature extraction adds a cross-stage partial connection (CSP) structure to the spatial–temporal graph convolutional networks’ bottleneck structure in the ResGCN network, to effectively filter interference information. It also inserts XBNBlock, on the basis of the CSP structure, to reduce estimation caused by network layer deepening and small-batch-size training. The experimental results of our model on the gait dataset CASIA-B achieve an average recognition accuracy of 79.5%. The proposed method can also achieve 78.1% accuracy on the CASIA-B sample, after training with a limited number of image frames, which means that the model is more robust. Full article
Show Figures

Figure 1

19 pages, 4363 KiB  
Article
TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition
by Kaixuan Wang and Hongmin Deng
Sensors 2023, 23(12), 5593; https://doi.org/10.3390/s23125593 - 15 Jun 2023
Cited by 2 | Viewed by 1546
Abstract
For skeleton-based action recognition, graph convolutional networks (GCN) have absolute advantages. Existing state-of-the-art (SOTA) methods tended to focus on extracting and identifying features from all bones and joints. However, they ignored many new input features which could be discovered. Moreover, many GCN-based action [...] Read more.
For skeleton-based action recognition, graph convolutional networks (GCN) have absolute advantages. Existing state-of-the-art (SOTA) methods tended to focus on extracting and identifying features from all bones and joints. However, they ignored many new input features which could be discovered. Moreover, many GCN-based action recognition models did not pay sufficient attention to the extraction of temporal features. In addition, most models had swollen structures due to too many parameters. In order to solve the problems mentioned above, a temporal feature cross-extraction graph convolutional network (TFC-GCN) is proposed, which has a small number of parameters. Firstly, we propose the feature extraction strategy of the relative displacements of joints, which is fitted for the relative displacement between its previous and subsequent frames. Then, TFC-GCN uses a temporal feature cross-extraction block with gated information filtering to excavate high-level representations for human actions. Finally, we propose a stitching spatial–temporal attention (SST-Att) block for different joints to be given different weights so as to obtain favorable results for classification. FLOPs and the number of parameters of TFC-GCN reach 1.90 G and 0.18 M, respectively. The superiority has been verified on three large-scale public datasets, namely NTU RGB + D60, NTU RGB + D120 and UAV-Human. Full article
Show Figures

Figure 1

18 pages, 13751 KiB  
Article
HPnet: Hybrid Parallel Network for Human Pose Estimation
by Haoran Li, Hongxun Yao and Yuxin Hou
Sensors 2023, 23(9), 4425; https://doi.org/10.3390/s23094425 - 30 Apr 2023
Viewed by 1219
Abstract
Hybrid models which combine the convolution and transformer model achieve impressive performance on human pose estimation. However, the existing hybrid models on human pose estimation, which typically stack self-attention modules after convolution, are prone to mutual conflict. The mutual conflict enforces one type [...] Read more.
Hybrid models which combine the convolution and transformer model achieve impressive performance on human pose estimation. However, the existing hybrid models on human pose estimation, which typically stack self-attention modules after convolution, are prone to mutual conflict. The mutual conflict enforces one type of module to dominate over these hybrid sequential models. Consequently, the performance of higher-precision keypoints localization is not consistent with overall performance. To alleviate this mutual conflict, we developed a hybrid parallel network by parallelizing the self-attention modules and the convolution modules, which conduce to leverage the complementary capabilities effectively. The parallel network ensures that the self-attention branch tends to model the long-range dependency to enhance the semantic representation, whereas the local sensitivity of the convolution branch contributes to high-precision localization simultaneously. To further mitigate the conflict, we proposed a cross-branches attention module to gate the features generated by both branches along the channel dimension. The hybrid parallel network achieves 75.6% and 75.4%AP on COCO validation and test-dev sets and achieves consistent performance on both higher-precision localization and overall performance. The experiments show that our hybrid parallel network is on par with the state-of-the-art human pose estimation models. Full article
Show Figures

Figure 1

28 pages, 22634 KiB  
Article
WiTransformer: A Novel Robust Gesture Recognition Sensing Model with WiFi
by Mingze Yang, Hai Zhu, Runzhe Zhu, Fei Wu, Ling Yin and Yuncheng Yang
Sensors 2023, 23(5), 2612; https://doi.org/10.3390/s23052612 - 27 Feb 2023
Cited by 2 | Viewed by 2037
Abstract
The past decade has demonstrated the potential of human activity recognition (HAR) with WiFi signals owing to non-invasiveness and ubiquity. Previous research has largely concentrated on enhancing precision through sophisticated models. However, the complexity of recognition tasks has been largely neglected. Thus, the [...] Read more.
The past decade has demonstrated the potential of human activity recognition (HAR) with WiFi signals owing to non-invasiveness and ubiquity. Previous research has largely concentrated on enhancing precision through sophisticated models. However, the complexity of recognition tasks has been largely neglected. Thus, the performance of the HAR system is markedly diminished when tasked with increasing complexities, such as a larger classification number, the confusion of similar actions, and signal distortion To address this issue, we eliminated conventional convolutional and recurrent backbones and proposed WiTransformer, a novel tactic based on pure Transformers. Nevertheless, Transformer-like models are typically suited to large-scale datasets as pretraining models, according to the experience of the Vision Transformer. Therefore, we adopted the Body-coordinate Velocity Profile, a cross-domain WiFi signal feature derived from the channel state information, to reduce the threshold of the Transformers. Based on this, we propose two modified transformer architectures, united spatiotemporal Transformer (UST) and separated spatiotemporal Transformer (SST) to realize WiFi-based human gesture recognition models with task robustness. SST intuitively extracts spatial and temporal data features using two encoders, respectively. By contrast, UST can extract the same three-dimensional features with only a one-dimensional encoder, owing to its well-designed structure. We evaluated SST and UST on four designed task datasets (TDSs) with varying task complexities. The experimental results demonstrate that UST has achieved recognition accuracy of 86.16% on the most complex task dataset TDSs-22, outperforming the other popular backbones. Simultaneously, the accuracy decreases by at most 3.18% when the task complexity increases from TDSs-6 to TDSs-22, which is 0.14–0.2 times that of others. However, as predicted and analyzed, SST fails because of excessive lack of inductive bias and the limited scale of the training data. Full article
Show Figures

Figure 1

14 pages, 2437 KiB  
Article
Pose Mask: A Model-Based Augmentation Method for 2D Pose Estimation in Classroom Scenes Using Surveillance Images
by Shichang Liu, Miao Ma, Haiyang Li, Hanyang Ning and Min Wang
Sensors 2022, 22(21), 8331; https://doi.org/10.3390/s22218331 - 30 Oct 2022
Cited by 1 | Viewed by 1818
Abstract
Solid developments have been seen in deep-learning-based pose estimation, but few works have explored performance in dense crowds, such as a classroom scene; furthermore, no specific knowledge is considered in the design of image augmentation for pose estimation. A masked autoencoder was shown [...] Read more.
Solid developments have been seen in deep-learning-based pose estimation, but few works have explored performance in dense crowds, such as a classroom scene; furthermore, no specific knowledge is considered in the design of image augmentation for pose estimation. A masked autoencoder was shown to have a non-negligible capability in image reconstruction, where the masking mechanism that randomly drops patches forces the model to build unknown pixels from known pixels. Inspired by this self-supervised learning method, where the restoration of the feature loss induced by the mask is consistent with tackling the occlusion problem in classroom scenarios, we discovered that the transfer performance of the pre-trained weights could be used as a model-based augmentation to overcome the intractable occlusion in classroom pose estimation. In this study, we proposed a top-down pose estimation method that utilized the natural reconstruction capability of missing information of the MAE as an effective occluded image augmentation in a pose estimation task. The difference with the original MAE was that instead of using a 75% random mask ratio, we regarded the keypoint distribution probabilistic heatmap as a reference for masking, which we named Pose Mask. To test the performance of our method in heavily occluded classroom scenes, we collected a new dataset for pose estimation in classroom scenes named Class Pose and conducted many experiments, the results of which showed promising performance. Full article
Show Figures

Figure 1

Back to TopTop