Next Article in Journal
Handcrafted Deep-Feature-Based Brain Tumor Detection and Classification Using MRI Images
Next Article in Special Issue
Computer Vision-Based Kidney’s (HK-2) Damaged Cells Classification with Reconfigurable Hardware Accelerator (FPGA)
Previous Article in Journal
A Common-Ground-Type Five-Level Inverter with Dynamic Voltage Boost
Previous Article in Special Issue
Semi-Supervised Group Emotion Recognition Based on Contrastive Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Metric-Based Key Frame Extraction for Gait Recognition

1
School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China
2
School of Art Design, Guangzhou College of Commerce, Guangzhou 511363, China
3
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
*
Authors to whom correspondence should be addressed.
Electronics 2022, 11(24), 4177; https://doi.org/10.3390/electronics11244177
Submission received: 11 November 2022 / Revised: 10 December 2022 / Accepted: 10 December 2022 / Published: 14 December 2022
(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, Volume II)

Abstract

:
Gait recognition is one of the most promising biometric technologies that can identify individuals at a long distance. From observation, we find that there are differences in the length of the gait cycle and the quality of each frame in the sequence. In this paper, we propose a novel gait recognition framework to analyze human gait. On the one hand, we designed the Multi-scale Temporal Aggregation (MTA) module that models temporal and aggregate contextual information with different scales, on the other hand, we introduce the Metric-based Frame Attention Mechanism (MFAM) to re-weight each frame by the importance score, which calculates using the distance between frame-level features and sequence-level features. We evaluate our model on two of the most popular public datasets, CASIA-B and OU-MVLP. For normal walking, the rank-1 accuracies on the two datasets are 97.6% and 90.1%, respectively. In complex scenarios, the proposed method achieves accuracies of 94.8% and 84.9% on CASIA-B under bag-carrying and coat-wearing walking conditions. The results show that our method achieves the top level among state-of-the-art methods.

1. Introduction

Nowadays, gait recognition plays a significant role in personal identification, and different from other biometrics such as the face, fingerprint, and iris, the human gait is the only one that can be captured in long-distance conditions and the recognition process does not need the subject’s cooperation. Therefore, with the popularization of video surveillance equipment, gait recognition technology has broad applications in crime prevention, forensic identification, and social security. However, in real-world scenarios, the performance of gait recognition suffers from many conditions such as changing clothing, carrying conditions, and the camera’s viewpoint.
Recently, lots of deep convolutional neural-network-based methods have been proposed to address these issues. Zhang et al. [1] proposed a new auto-encoder framework to explicitly separate posture and appearance features from RGB images and then used LSTMs to model the temporal changes of gait sequence. Chao et al. [2] hypothesized that the appearance of a silhouette contains position information and the sequence information of gait was unnecessary for recognition, so they proposed a novel network named GaitSet that regarded gait silhouettes as a set to extract temporal information. Fan et al. [3] employed partial features for a human body description and proposed a new model named GaitPart, which focuses on the short-range temporal features rather than the redundant long-range features for gait cycles. Li et al. [4] enhanced the fine-grained learning of human partial features by segmenting and associating adjacent body parts from top to bottom. Lin et al. [5] assumed that the representations based on global information often neglect the details of the gait frame, while local region-based descriptors cannot capture the relations among neighboring regions, and thus designed a new global and local feature extraction module to address this issue.
These previous methods [3,4] extract fine-grained features from human body parts and model short-term motion patterns, effectively improving the performance of gait recognition models. However, from observation, we find that the differences in the pedestrian walking speed and camera frame rate have resulted in inconsistent frame length of the gait cycle (as shown in Figure 1), and thus the single temporal modeling approach cannot adapt to the diversity of motion.
Furthermore, unlike face recognition [6,7], fingerprint recognition [8,9], etc., which extracts identity information from a single image, gait recognition technology is based on video sequences. In the early days, most methods [10,11,12] fused sequence features by generating template images. In recent years, video sequence-based methods [2,3,4,5] aggregate sequence-level features by simple temporal pooling of frame-level features; however, this adaptive fusion method ignores the differences in the quality of frames (as shown in Figure 1), which affects the performance of the gait recognition model in various scenarios.
To alleviate these issues, we propose a novel gait recognition framework, which consists of two well-designed novel components, namely Multi-scale Temporal Aggregation (MTA) and Metric-based Frame Attention Mechanism (MFAM). MTA aggregates multi-scale context information through gait temporal modeling. MFAM calculates the importance score of each frame according to the Euclidean distance between it and the aggregated sequence features.
In summary, the major works of this paper are as follows:
(1)
We propose the Multi-scale Temporal Aggregation module, which models gait temporal information in multiple scales, to accommodate diverse representations of motion;
(2)
We introduce the Metric-based Frame Attention Mechanism, which assigns weights to each frame with an importance score calculated by the distance between frame-level features and sequence-level features;
(3)
The proposed method has been evaluated on the widely used CASIA-B [13] and OU-MVLP [14] gait benchmark datasets. The experimental results of our method achieve high recognition accuracies under cross-view and various walking conditions.

2. Related Work

2.1. Gait Recognition

Existing gait recognition methods can be divided into two main categories: model-based [15,16] and sequence-based [17,18,19,20]. Model-based methods use the structure or motion model of the human body, such as gait period, stride scale, joint angle trajectories, etc. Liao et al. [21]. used the pose estimation method to extract 2D pose key points and extracted spatiotemporal invariant features from the gait pose, which effectively improved the performance of the network. Later, Liao et al. [22] assumed that the 3D pose defined by 3D coordinates was constant, and combined them with human pose priors, such as the motion relationship of the upper and lower limbs, motion trajectories, etc., to extract gait features, which improved the accuracy of the algorithm in the scene of changing perspectives. Model-based methods are not sensitive to changes in covariates such as viewing angle and clothing; however, the model-based method’s recognition accuracy depends on the performance of pose estimation algorithms [23,24].
Early sequence-based methods usually compress the frame sequence into a gait template image (such as Gait Energy Image (GEI) [10], gait entropy image (GEnI) [11], period energy image (PEI) [12], etc.), and then extract gait features from the template image. The similarity between features is measured by machine learning algorithms, and finally, a label is assigned to each template image by some classifiers. Recently, due to the good performance of deep learning in various image processing tasks, Thomas et al. [25] applied 3D CNN to capture robust spatial–temporal gait features in multiple views; however, traditional 3D CNNs require fixed-length gait sequences for classification and thus are not able to address different lengths of videos directly. Chao et al. [2] regarded the gait silhouette sequence as a set and propose a new network named GaitSet to learn identity information from the set. The GaitSet model improves the recognition rate in various scenarios and ensures flexibility and effectiveness. Hajra et al. [26] utilize gait dynamics for gait feature extraction and the spatiotemporal power spectral gait features are utilized for a quadratic support vector machine classifier for gait recognition.

2.2. Temporal Modeling

In the literature, GaitSet considered a gait sequence as an unordered set consisting of independent frames, which ensures the flexibility of the model but limits the application efficiency of temporal information. To extract temporal features of gait, 1D temporal convolution and LSTM are usually used for gait temporal modeling. Fan et al. [3] believed that each part of the human body has its own unique motion pattern, and uses one-dimensional convolution to extract local short-range temporal features. Zhang et al. [1] used an LSTM network to achieve long-short temporal modeling of gait. Lin et al. [5] assumed that the set pooling operation [2,3] will bring the loss of spatial information, and thus proposed a novel local temporal aggregation operation to aggregate local temporal information.
The LSTM-based method [1,19] preserves unnecessary temporal constraints and is computationally expensive. The short-term modeling method improves recognition accuracy to a certain extent. However, the single-time modeling method cannot adapt to the complexity of motion and the change of realistic factors. Therefore, we propose the Multi-scale Temporal Aggregation module to aggregate temporal features from multiple different scales, so that the model can adapt to the diversity of motion.

2.3. Key Frame in Sequence

Compared with images, videos are richer in spatiotemporal information. However, there is too much redundant information in sequence, so extracting the information of key frames is crucial for many tasks. In the person re-identification task, Song et al. [27] propose an RQEN model, which judges the quality of pictures and reduces the importance of poor-quality frames. Ding et al. [28] propose a new key frame extraction method, frame difference and cluster (FDC), that integrates the idea of K-means clustering. To obtain more discriminative gait features, Wang et al. [29] propose a feature extraction algorithm based on local gait energy image (LGEI) and calculate LGEIs for each key frame. Li et al. [4] propose a residual frame attention mechanism (RFAM) module to highlight the key frames of sequences based on the slice features to extract the key frames from each body part.
Current gait recognition methods usually aggregate frame-level features into sequence features with adaptive temporal pooling, which ignores the importance of differences between frames. Therefore, we propose the Metric-based Frame Attention Mechanism (MFAM) module, which scores the importance of each frame by measuring the distance between sequence-level features and frame-level features, and then generates an appropriate sequence-level representation in a weighted summation manner to highlight key frames within the sequence.

3. Proposed Method

In this section, we first overview the framework of the proposed method. Then introduce the Multi-scale Temporal Aggregation (MTA) and the Metric-based Frame Attention Mechanism (MFAM). Finally, we introduce the details of training and testing. The framework of the proposed algorithm is shown in Figure 2.

3.1. Overview

As shown in Figure 2, the input of the framework is a sequence of gait silhouettes, which has a dimension of C T H W , where C represents the number of channels for the input frame, T represents the frame number of sequences, and H ,   W represent the height and width dimensions, respectively. In the framework, a 3D convolution is used to extract shallow features from the original input sequence, and the extracted shallow features dimensions are C 1 T H W . Then, the Multi-scale Temporal Aggregation (MTA) module is designed with several parallel temporal convolutional layers, which aggregate temporal features from multiple different scales. The output dimensions of the MTA module are C 1 T 1 H W . After that, the encoder network is used to extract frame-level features which are denoted as F t ( t 1 ,   2 ,     , T 1 ) with the dimension of C 2 H 1 W 1 . In this paper, we adopt the GLFE proposed by Lin et al. [5] as the encoder. Specifically, the encoder consists of a global feature extraction branch and a local feature extraction branch, each containing 3 convolutional layers. There is a pooling layer after the first convolutional layer. The features of the two branches are fused using an addition or concatenation operation, and the concatenation operation is only performed after the last layer of convolution.
Then, we employ two parallel branches to process frame-level features separately. On the one hand, temporal pooling (TP) operation is adopted to aggregate frame-level features into a sequence-level feature which is denoted as F and with the dimension of C 2 H 1 W 1 . Similar operations were commonly used in [2,3,4]. Then, in order to reduce the redundancy of data, the Generalized-Mean pooling (GeM) [5] operation is used to map the sequence-level feature ( F ) into 1D feature vector (denoted as S with the dimension of C 2 K ). On the other hand, the GeM operations are also used directly to reduce data redundancy for frame-level features.
Finally, the Metric-based Frame Attention Mechanism (MFAM) is designed to calculate the importance score for each frame and thus re-weight for all frames; the weighted features are encoded into high-dimensional vectors using several separate FC layers as gait representations.

3.2. Multi-Scale Temporal Aggregation

As discussed in Section 2.2, existing methods [2,3,4] either only focus on spatial modeling and thus ignore the inter-frame dependence, or focus on the short-term features of gait cycles, which cannot adapt to the changes of complex motion and environmental factors. Therefore, we designed the Multi-scale Temporal Aggregation (MTA) module, which aims at aggregating contextual information at different scales.
As shown in Figure 3, the input of MTA module has a dimension of C 1 T H W , which represents the number of channels, sequence length, and size of each frame, respectively. In order to capture the temporal features with different scales, which are denoted as T s , T m , and T l , respectively, MTA employs three parallel convolutions, which adopt different sizes of the kernel. The specific parameter settings of the convolutional layers are shown in Table 1, especially, the three convolutions stride are all three.
After that, MTA applies information flowing from small scale to large scale among temporal features by average pooling ( a v g ), and the formula is as follows:
T s = T s T m = a v g ( T s , T m ) T l = a v g ( T s , T m , T l )
Finally, MTA adopts max pooling (max) to aggregate context information at different scales.
T = m a x ( T s ,   T m , T l )
Based on the MTA module, the model aggregates context information of different time scales, provides multiple time receptive fields through information exchange and fusion between features, and can effectively adapt to the diverse expression of human motion.

3.3. Metric-Based Frame Attention Mechanism

The rich spatiotemporal features in video sequences provide more discriminative information for recognition tasks; however, redundant information in videos will also affect recognition accuracy. Therefore, researchers [4,27,28,29] focus on extracting key information from videos to improve the performance of the model. In this paper, we obtained the importance score of each frame by calculating the similarity between frame-level features and sequence-level features obtained by temporal pooling, and then weighted each frame and aggregated it into sequence features.
As shown in Figure 4, the input of MFAM module can divide into a frame-level feature vector and sequence-level feature vector, with dimensions of C 2 T 1 K and C 2 K , respectively, where C 2 represents the number of channels, T 2 represents the sequence length, and K represents the number of preset body parts.
In order to assess the importance of each frame, MFAM first calculates the Euclidean distance between frame-level feature vectors and sequence-level feature vectors and uses max-min normalized processing of the numeric distance. Then, we take the negative of these numeric distances and apply the sigmoid activation function to capture the score (denoted as w t ) of each frame. The score of each frame is calculated as follows:
D t = | | p t P | | 2 w t = σ ( D t min ( D t ) max ( D t ) min ( D t ) )
where σ represents the sigmoid function and min and max represent the minimum and maximum scores, respectively.
After that, all frames are weighted by the importance score, and the re-weight frame-level features (denoted as   p t ) are aggregated into re-weight sequence-level features (denoted as   P ) by temporal pooling ( T P ). In this paper, temporal pooling module employs global max pooling (GAP) operation to aggregate features. The calculation processes are as follows:
p t = w t p t P = G A P ( p t )
Finally, MFAM fuses the initial sequence-level features with weighted sequence-level features as the final sequence-level features (denoted as   P f u s e ).
P f u s e = P + P

3.4. Training and Testing

During the training stage, we input a gait sequence into the network and obtained the gait feature descriptors; the batch size of the input training data is p     k , where   p represents the number of persons and k represents the number of training samples of each person in the batch. Then, the Batch All (BA+) triplet loss [30] and cross-entropy loss are employed to optimize the model.
During the Testing Stage, the test dataset is divided into gallery set and probe set, and we input the whole gait sequences into the network to generate gait feature descriptors. To calculate rank-1 accuracy, the gallery set is regarded as the standard view to be retrieved, and the descriptors of the probe are used to match the descriptors from the gallery view based on the average Euclidean distance.

4. Experimental Results

4.1. Dataset

We use two open databases, CASIA-B [13] and OU-MVLP [14], to evaluate the performance of the proposed method.
CASIA-B. Contains the gait sequences of 124 subjects, CASIA-B is a widely applied gait dataset, and each subject contains 3 walking conditions and 11 views (0°~180°, with an interval of 18°). The walking condition contains normal (NM) (six sequences per subject), walking with a bag (BG) (two sequences per subject), and wearing a coat or jacket (CL) (two sequences per subject). In other words, each subject contains 11 × (6 + 2 + 2) = 110 sequences. As there is no official partition of training and test sets of this dataset, we conduct large-sample training (LT), medium-sample training (MT), and small-sample training (ST) according to Chao et al. [2] During the testing stage, the first four sequences of the NM condition (NM#1–4) are stored in the gallery set and the remaining six sequences (NM#5–6, BG#1–2 and Cl#1–2) are stored in the probe set.
OU-MVLP. Contains the gait sequences of 10307 subjects, OU-MVLP is so far the world’s largest public gait dataset, and each subject contains two sequences (#00 and #01), with fourteen views (0°, 15°,…, 90°, 180°, 195°,…, 270°) for each sequence. The sequences are divided into training and test sets by subjects (5153 subjects for training and 5154 subjects for testing). During the testing stage, sequences with index #01 are kept in a gallery and those with index #00 are used as probes.

4.2. Training and Testing Details

In all the experiments, the silhouettes are directly provided by the datasets and are aligned by the method proposed by Takemura et al. [14] and resized to the size of 64 × 44. In the training stage, we choose Adam [31] as an optimizer and set the margin in Batch All (BA+) triplet loss to 0.2. For CASIA-B, the batch size parameters P and K are set to 8 and 10. In the setting of ST, MT, and LT, the epoch number is set to 60 K, 80 K, and 80 K, respectively, and the learning rate is set to 1 × 10−4. For OUMVLP, the batch size parameters P and K are set to 32 and 10, respectively, and the epoch number is set to 250 K. The learning rate is first set to 1 × 10−4, and reset to 1 × 10−5 and 5 × 10−6 after 180 K and 230 K iterations, respectively.

4.3. Comparison with State-of-the-Art Methods

4.3.1. Experimental Results on CASIA-B Dataset

In order to verify the efficacy and superiority of our method, we compare the performance of our model with other state-of-the-art gait recognition models, including CNN-LB [32], MGAN [33], GaitSet [2], GaitSlice [4], and GaitGL [5] on the CASIA-B gait dataset based on the rank-1 accuracy. The experimental results are shown in Table 2, Table 3 and Table 4, and Figure 5. Except for ours, other results are directly taken from their original papers. All the results are averaged on the 11 gallery views and the identical views are excluded.
As can be seen from Table 2, our model obtains very nice results by using LT. Under the three walking conditions NM, BG, and CL, the average recognition accuracy reached 97.6%, 94.7%, and 84.9%, which outperformed GaitGL [5] by 0.2%, 0.3%, and 1.3%, respectively. Comparison results show that the biggest breakthrough of our method is to improve recognition accuracy under CL conditions. Due to the presence of occlusion, the recognition rate of the existing models under CL conditions is low, but in our method, the acquisition of key frames reduces the interference of redundant information, and thus improves the recognition accuracy.
From Table 3 and Table 4, it can be seen that under ST and MT settings, compared with existing methods, our model does not achieve an absolute advantage in NM and BG conditions. However, in the CL condition, our model has an average recognition accuracy of 80.0%(MT) and 60.0%(ST), and when compared with the best-performing GaitGL [5], the recognition accuracy is improved by 1.7%(MT) and 2.6%(ST).
Moreover, compared with other views, the existing models have the lowest recognition rate in the view of 0° and 180° due to excessive interference information. In this paper, our method improves the recognition rate in most views, especially at 0° and 180°. For example, compared with GaitGL, the recognition rate in the three walking conditions was improved by 1.2%, 0.4%, and 3.9%, respectively, in the 180° view and LT setting.
In summary, the results of the comparative analysis show that our model outperforms the existing gait recognition models in the LT, MT, and ST settings, especially in complex application scenarios (such as coat-wearing walking conditions and the view of 0° and 180°); this demonstrates the effectiveness and superiority of our model.

4.3.2. Experimental Results on OU-MVLP Dataset

In order to verify the generalization of the proposed method, we further evaluate the performance of our method on the OUMVLP dataset. As shown in Table 5, compared with the existing models (including GEINet [34], GaitSet [2], GaitPart [3], GaitSlice [4], and GaitGL [5]), our model achieves the highest accuracy in various views. Especially in the view with less discriminative information and too much redundant information, such as 0°, 90°, 180°, and 270°. The comparative analysis results on the OU-MVLP dataset show that our model has good generalization.

4.4. Ablation Experiment

To verify the effectiveness of MTA and MFAM in the proposed framework, several ablation studies with various settings will be conducted on CASIA-B. In MTA, we study the influence of different size convolution on the model performance. In MFAM, we studied the influence of different distance measurement methods and different data normalization methods on the recognition rate. The experimental results and analysis are as follows.

4.4.1. Efficacy of MTA

In order to aggregate context information with different scales and adapt the model to complex human motion patterns, we introduced the Multi-scale Temporal Aggregation module in the proposed method. To analyze the appropriate parameter setting in MTA operation, three controlled experiments are conducted in experiment Group A.
As shown in Table 6, to verify the effectiveness of the MTA module, we design the comparison experiment by implementing methods with different convolutional strategies on the LT setting. Specifically, experiments A-a, b, and c only adopted one convolution kernel with different sizes, and the comparison results with experiment A-d demonstrates the effectiveness of aggregating multi-scale contextual information operations. In addition, the comparison of experiments A-a, b, and c show that the expansion of the convolution size in MTA will lead to a decrease in the recognition accuracy of the model in complex scenes. Therefore, we choose the parameter settings in experiment A-b as the convolution combination.

4.4.2. Efficacy of MFAM

Video sequences are rich in semantic information but they also bring too many redundant features. In this paper, we introduced the Metric-based Frame Attention Mechanism to extract key frame features from sequences. As discussed in Section 3.3, MFAM calculates the importance score of each frame by measuring the Euclidean distance between frame-level features and sequence-level features. To analyze the rationality of the MFAM using Euclidean distance to calculate the importance score, we employ cosine similarity instead of Euclidean distance, as follows:
D t = p t ·   P | p t | · | P |
Moreover, normalization of the data can eliminate the undesirable effects caused by odd sample data. In this section, we introduced the Z-score normalized processing of the numeric distance, the formula is expressed as:
D t = D t μ σ
where μ and σ represent the mean and variance of the numerics, respectively.
As shown in Table 7, to verify the effectiveness of the MFAM module, we design the comparison experiment by implementing methods with different importance score calculation strategies on the LT setting.
On the one hand, the comparison shows that employing Euclidean distance to measure the similarity between features is conducive to obtaining better performance of the model. This is because the model employs Euclidean distance as the accuracy evaluation standard. On the other hand, the comparison results of different data normalization methods show that a Z-score normalized is only applicable to NM scenarios. In complex application scenarios such as BG and CL, max-min normalized shows superior recognition accuracy. Therefore, we chose the combination of Euclidean distance and max-min normalized to calculate the importance score of each frame.

4.5. Practicality Experiments

In real-world settings, it may be difficult to acquire a sufficient number of frames for gait recognition. To verify the practicability of the proposed model, we select a certain number of frames for each subject during the testing phase.
As shown in Table 8, we selected different numbers of frames as input. The results show that the model achieves accuracies of 57.4%, 50.4%, and 33.2% in the three walking conditions when inputting 10 frames. This is because the model uses an encoder composed of 3D convolutions. When the input frame is insufficient, it is difficult for the model to extract gait temporal features. When the input reaches 30 frames, the model recognition accuracies also achieve 94.5%, 90.1%, and 76.6%. When the input exceeds 70 frames, the model recognition accuracy also tends to be stable.

4.6. Portability Experiments

It is worth noting that our MFAM operation can be used in some state-of-the-art gait recognition models [2,3,4,5]. In the GaitSlice model proposed by Li et al., the RFAM module is designed to calculate the importance score of each frame by the frame attention network. To compare the performance differences between the two attention mechanisms, three controlled experiments are conducted. In the experiments, we use the experimental results of the GaitSlice [4] model (denoted as GaitSlice* and achieves accuracies of 96.6%, 91.7%, and 80.7%, respectively, under three walking conditions) with the RFAM module removed as the baseline.
As shown in Figure 6, under the action of a single RFAM module, the recognition rates of the three walking scenarios are increased by 0.06%, 0.71%, and 0.86%, respectively. Under the action of our MFAM module, the recognition rates are increased by 0.16%, 0.75%, and 1.13%, which outperform RFAM by 0.1%, 0.04%, and 0.27%, respectively. Under the joint action of the two attention modules, the recognition rate is increased by 0.25%, 0.81%, and 1.28%, respectively. The above results demonstrate the effectiveness and superiority of our proposed MFAM operation.

4.7. Visualization

In order to better understand the role of the MFAM module, the importance scores of several frames are given in Figure 7. It is worth noting that we horizontally divide the human body into 32 parts in the model, then calculate each part’s importance score. For the convenience of illustration, the average of the importance scores of the adjacent eight parts is calculated in Figure 7. In other words, we show the weights of the four parts of the human body. It can be seen from Figure 7 that the importance scores of each frame in the sequence are different, indicating that the frames contain different degrees of semantic information.

5. Conclusions

In this paper, we propose a new gait recognition framework, which models gait temporal features and highlights the key frames in the sequence to improve recognition accuracy. Specifically, the proposed MTA module extracts multi-scale context information through parallel convolution and carries out information exchange and fusion, so that the model can adapt to the changes of complex motion and realistic factors. In MFAM, the importance score of each frame is calculated by the distance between frame-level features and sequence-level features. After that, more discriminative gait features are extracted by reweighting key frames for each frame. Finally, experiments are conducted on the widely adopted public databases, CASIA-B and OUMVLP, which experimentally demonstrate the superiority of the proposed method. Nevertheless, the CASIA-B and OU-MVLP datasets were collected in an indoor environment. Although they include different walking conditions such as viewing angles, clothing, and carrying objects, there are still differences with the pedestrian data under real conditions. In future work, the model will be optimized for the dataset collected in the open environment.

Author Contributions

Conceptualization, T.W. and R.C.; methodology, T.W. and H.Z.; software, R.L. and R.C.; validation, J.Z. and H.L.; formal analysis, R.L. and J.W.; investigation, J.Z. and J.W.; resources, R.L. and H.Z.; data curation, T.W. and J.Z.; writing—original draft preparation, T.W. and H.L.; writing—review and editing, T.W. and H.Z.; supervision, H.Z.; project administration, T.W. and H.Z.; funding acquisition, H.Z. and R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No.62072122), the Scientific and Technological Planning Projects of Guangdong Province (2021A0505030074), the Scientific Research Capability Improvement Project of Guangdong Key Construction subject (2021ZDJS025), the Postgraduate Education Innovation Plan Project of Guangdong Province (2020SFKC054), the Special Projects in Key Fields of Ordinary Universities of Guangdong Province under Grant (2021ZDZX1087) and the Special Projects in key Fields of Department of Education of Guangdong Province (2022ZDZX1013).

Data Availability Statement

The data are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, Z.; Tran, L.; Liu, F.; Liu, X. On learning disentangled representations for gait recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 345–360. [Google Scholar] [CrossRef] [PubMed]
  2. Chao, H.; He, Y.; Zhang, J.; Feng, J. Gaitset: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; pp. 8126–8133. [Google Scholar]
  3. Fan, C.; Peng, Y.; Cao, C.; Liu, X.; Hou, S.; Chi, J.; Huang, Y.; Li, Q.; He, Z. GaitPart: Temporal part-based model for gait recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14225–14233. [Google Scholar]
  4. Li, H.; Qiu, Y.; Zhao, H.; Zhan, J.; Chen, R.; Wei, T.; Huang, Z. GaitSlice: A gait recognition model based on spatio-temporal slice features. Pattern Recognit. 2022, 124, 108453. [Google Scholar] [CrossRef]
  5. Lin, B.; Zhang, S.; Yu, X. Gait recognition via effective global-local feature representation and local temporal aggregation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14648–14656. [Google Scholar]
  6. Wang, K.; Wang, S.; Zhang, P.; Zhou, Z.; Zhu, Z.; Wang, X.; Peng, X.; Sun, B.; Li, H.; You, Y. An efficient training approach for very large scale face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4083–4092. [Google Scholar]
  7. He, M.; Zhang, J.; Shan, S.; Chen, X. Enhancing Face Recognition with Self-Supervised 3D Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 4062–4071. [Google Scholar]
  8. Öztürk, H.İ.; Selbes, B.; Artan, Y. MinNet: Minutia Patch Embedding Network for Automated Latent Fingerprint Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1627–1635. [Google Scholar]
  9. Chen, S.; Guo, Z.; Li, X.; Yang, D. Query2Set: Single-to-Multiple Partial Fingerprint Recognition Based on Attention Mechanism. IEEE Trans. Inf. Secur. 2022, 17, 1243–1253. [Google Scholar] [CrossRef]
  10. Han, J.; Bhanu, B. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 28, 316–322. [Google Scholar] [CrossRef] [PubMed]
  11. Bashir, K.; Xiang, T.; Gong, S. Gait recognition using gait entropy image. In Proceedings of the 3rd International Conference on Imaging for Crime Detection and Prevention (ICDP 2009), London, UK, 3 December 2009. [Google Scholar]
  12. Wang, C.; Zhang, J.; Wang, L.; Pu, J.; Yuan, X. Human identification using temporal information preserving gait template. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 2164–2176. [Google Scholar] [CrossRef] [PubMed]
  13. Yu, S.; Tan, D.; Tan, T. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 4, pp. 441–444. [Google Scholar]
  14. Takemura, N.; Makihara, Y.; Muramatsu, D.; Echigo, T.; Yagi, Y. Multi-view large population gait dataset and its performance evaluation for crossview gait recognition. IPSJ Trans. Comput. Vis. Appl. 2018, 10, 4. [Google Scholar] [CrossRef] [Green Version]
  15. Rong, Z.; Vogler, C.; Metaxas, D. Human Gait Recognition. In Proceedings of the Conference on Computer Vision & Pattern Recognition Workshop, Washington, DC, USA, 27 June–2 July 2004. [Google Scholar]
  16. Hong, C.; Yu, J.; Tao, D.; Wang, M. Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans. Ind. Electron. 2014, 62, 3742–3751. [Google Scholar]
  17. Huang, Z.; Xue, D.; Shen, X.; Tian, X.; Li, H.; Huang, J.; Hua, X.-S. 3D local convolutional neural networks for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14920–14929. [Google Scholar]
  18. Hou, S.; Cao, C.; Liu, X.; Huang, Y. Gait lateral network: Learning discriminative and compact representations for gait recognition. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 382–398. [Google Scholar]
  19. Zhang, Z.; Tran, L.; Yin, X.; Atoum, Y.O.; Wan, J.; Wang, N.; Liu, X. Gait recognition via disentangled representation learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  20. Qin, H.; Chen, Z.; Guo, Q.; Wu, Q.J.; Lu, M. RPNet: Gait Recognition with Relationships between Each Body-Parts. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2990–3000. [Google Scholar] [CrossRef]
  21. Liao, R.; Cao, C.; Garcia, E.B.; Yu, S.; Huang, Y. Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing variations. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  22. Liao, R.; Yu, S.; An, W.; Huang, Y. A model-based gait recognition method with body pose and human prior knowledge. Pattern Recognit. 2020, 98, 107069. [Google Scholar] [CrossRef]
  23. Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  24. Guler, R.A.; Neverova, N.; Kokkinos, I. DensePose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7297–7306. [Google Scholar]
  25. Wolf, T.; Babaee, M.; Rigoll, G. Multi-view gait recognition using 3d convolutional neural networks. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 4165–4169. [Google Scholar]
  26. Masood, H.; Farooq, H. Utilizing Spatio Temporal Gait Pattern and Quadratic SVM for Gait Recognition. Electronics 2022, 11, 2386. [Google Scholar] [CrossRef]
  27. Song, G.; Leng, B.; Liu, Y.; Hetang, C.; Cai, S. Region-based Quality Estimation Network for Large-scale Person Re-identification. arXiv preprint 2017, arXiv:1711.08766. [Google Scholar] [CrossRef]
  28. Ding, Y.; Hou, S.; Yang, X.; Du, W.; Wang, C.; Yin, G. Key Frame Extraction Based on Frame Difference and Cluster for Person Re-identification. In Proceedings of the IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI), Atlanta, GA, USA, 18–21 October 2021; pp. 573–578. [Google Scholar]
  29. Wang, X.; Feng, S.; Yan, W.Q. Human Gait Recognition Based on Self-Adaptive Hidden Markov Model. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 963–972. [Google Scholar] [CrossRef] [PubMed]
  30. Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
  31. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  32. Wu, Z.; Huang, Y.; Wang, L.; Wang, X.; Tan, T. A comprehensive study on cross-view gait based human identification with deep CNNs. IEEE TPAMI 2017, 39, 209–226. [Google Scholar] [CrossRef] [PubMed]
  33. He, Y.; Zhang, J.; Shan, H.; Wang, L. Multi-task GANs for view-specific feature learning in gait recognition. IEEE TIFS 2019, 14, 102–113. [Google Scholar] [CrossRef]
  34. Shiraga, K.; Makihara, Y.; Muramatsu, D.; Echigo, T.; Yagi, Y. GEINet: View-invariant gait recognition using a convolutional neural network. In Proceedings of the 2016 international conference on biometrics (ICB), Halmstad, Sweden, 13–16 June 2016; pp. 1–8. [Google Scholar]
Figure 1. Two complete gait cycle sequences from subjects ‘01’ and ‘10’ on CASIA-B. The top periodic sequence contains 24 frames, while the bottom contains 30 frames; the quality of frames in the sequence is different, and the frames with clear outlines and rich information are regarded as key frames (such as the frames in the red box).
Figure 1. Two complete gait cycle sequences from subjects ‘01’ and ‘10’ on CASIA-B. The top periodic sequence contains 24 frames, while the bottom contains 30 frames; the quality of frames in the sequence is different, and the frames with clear outlines and rich information are regarded as key frames (such as the frames in the red box).
Electronics 11 04177 g001
Figure 2. Overview of the gait recognition framework. The Multi-scale Temporal Aggregation (MTA) module extracts contextual information at different temporal scales. The Metric-based Frame Attention Mechanism (MFAM) is employed to highlight key frames in sequences. The TP and GeM represent the Temporal Pooling and Generalized-Mean pooling, respectively.
Figure 2. Overview of the gait recognition framework. The Multi-scale Temporal Aggregation (MTA) module extracts contextual information at different temporal scales. The Metric-based Frame Attention Mechanism (MFAM) is employed to highlight key frames in sequences. The TP and GeM represent the Temporal Pooling and Generalized-Mean pooling, respectively.
Electronics 11 04177 g002
Figure 3. The structure of Multi-scale Temporal Aggregation (MTA). The MTA module employed several parallel convolutions and slides on sequence dimensions to obtain the features at different scales.
Figure 3. The structure of Multi-scale Temporal Aggregation (MTA). The MTA module employed several parallel convolutions and slides on sequence dimensions to obtain the features at different scales.
Electronics 11 04177 g003
Figure 4. The structure of Metric-based Frame Attention Mechanism (MFAM). p t and P represent a set of human body parts feature vectors with frame-level features ( s t ) and sequence-level features ( S ), respectively.
Figure 4. The structure of Metric-based Frame Attention Mechanism (MFAM). p t and P represent a set of human body parts feature vectors with frame-level features ( s t ) and sequence-level features ( S ), respectively.
Electronics 11 04177 g004
Figure 5. Average rank-1 recognition accuracy of our method compared to other state-of-the-art gait recognition models on CASIA-B under three partition settings.
Figure 5. Average rank-1 recognition accuracy of our method compared to other state-of-the-art gait recognition models on CASIA-B under three partition settings.
Electronics 11 04177 g005
Figure 6. Performance comparison experiment between MFAM and RFAM. Results are improvement compared to baseline (GaitSlice*) of rank-1 accuracies without view variation, excluding the identical-view cases. (a) Accuracy improvement of introducing RFAM (b); Accuracy improvement of introducing MFAM; (c) Accuracy improvement of introducing MFAM and RFAM.
Figure 6. Performance comparison experiment between MFAM and RFAM. Results are improvement compared to baseline (GaitSlice*) of rank-1 accuracies without view variation, excluding the identical-view cases. (a) Accuracy improvement of introducing RFAM (b); Accuracy improvement of introducing MFAM; (c) Accuracy improvement of introducing MFAM and RFAM.
Electronics 11 04177 g006
Figure 7. Importance scores for frames. The red words are the importance score of each part, and the blue words are the importance score of each frame.
Figure 7. Importance scores for frames. The red words are the importance score of each part, and the blue words are the importance score of each frame.
Electronics 11 04177 g007
Table 1. The structure of MTA.
Table 1. The structure of MTA.
kernel_1kernel_2kernel_3
kernel_size(3,1,1)(5,1,1)(7,1,1)
padding(0,0,0)(1,0,0)(2,0,0)
Table 2. Averaged rank-1 accuracies on CASIA-B under LT setting, excluding identical-view cases.
Table 2. Averaged rank-1 accuracies on CASIA-B under LT setting, excluding identical-view cases.
Gallery NM#1-40°–180°Mean
Probe18°36°54°72°90°108°126°144°162°180°
NM#5-6CNN-LB82.690.396.194.390.187.489.994.094.791.378.589.9
GaitSet90.897.999.496.993.691.795.097.898.996.885.895.0
GaitSlice95.599.299.699.094.492.595.098.199.798.392.996.7
GaitGL96.098.399.097.996.995.497.098.999.398.894.097.4
Ours96.798.698.998.096.795.497.298.799.298.795.297.6
BG#1-2CNN-LB64.280.682.776.964.863.168.076.982.275.461.372.4
GaitSet83.891.291.888.883.381.084.190.092.294.479.087.2
GaitSlice90.296.496.194.989.385.090.994.596.395.088.192.4
GaitGL92.696.696.895.593.589.392.296.598.296.991.594.5
Ours93.996.596.895.993.589.692.597.098.096.891.994.8
CL#1-2CNN-LB37.757.266.661.155.254.655.259.158.948.839.454.0
GaitSet61.475.480.777.372.170.171.573.573.568.450.070.4
GaitSlice75.687.088.986.580.577.579.184.084.883.670.181.6
GaitGL76.690.090.387.184.579.084.187.087.384.469.583.6
Ours78.090.091.688.584.679.884.788.288.286.473.484.9
Table 3. Averaged rank-1 accuracies on CASIA-B under MT setting, excluding identical-view cases.
Table 3. Averaged rank-1 accuracies on CASIA-B under MT setting, excluding identical-view cases.
Gallery NM#1-40°–180°Mean
Probe18°36°54°72°90°108°126°144°162°180°
NM#5-6MGAN54.965.972.174.871.165.770.075.676.268.653.868.1
GaitSet86.895.298.094.591.589.191.195.097.493.780.292.0
GaitSlice92.297.398.998.494.290.394.297.599.296.689.495.3
GaitGL93.997.698.897.395.292.795.698.198.596.591.295.9
Ours94.197.899.197.295.392.595.798.198.896.791.596.1
BG#1-2MGAN48.558.559.758.053.749.854.061.359.555.943.154.7
GaitSet79.989.891.286.781.676.781.088.290.388.573.084.3
GaitSlice85.292.295.394.287.883.887.193.193.491.680.989.5
GaitGL88.595.195.994.291.585.489.095.497.494.386.392.1
Ours88.794.696.494.690.586.089.395.597.895.086.992.3
CL#1-2MGAN3.134.536.333.332.932.734.237.633.726.721.031.5
GaitSet52.066.072.869.363.161.263.566.567.560.045.962.5
GaitSlice63.978.982.981.774.570.173.477.577.573.762.574.2
GaitGL70.783.287.184.778.271.378.083.783.677.163.178.3
Ours71.985.088.386.778.774.779.883.885.480.665.280.0
Table 4. Averaged rank-1 accuracies on CASIA-B under ST setting, excluding identical-view cases.
Table 4. Averaged rank-1 accuracies on CASIA-B under ST setting, excluding identical-view cases.
Gallery NM#1-40°–180°Mean
Probe18°36°54°72°90°108°126°144°162°180°
NM#5-6CNN-LB54.8--77.8-64.9-76.1---68.4
GaitSet64.683.390.486.580.275.580.386.087.181.459.679.5
GaitSlice75.784.592.391.382.877.183.189.391.086.271.484.1
GaitGL77.087.893.992.783.978.784.791.592.589.374.486.0
Ours77.487.694.092.883.978.884.691.792.689.574.886.2
BG#1-2GaitSet55.870.576.975.569.763.468.075.876.270.752.568.6
GaitSlice67.875.081.782.673.866.373.380.680.175.562.174.4
GaitGL68.181.287.784.976.370.576.184.587.083.665.078.6
Ours68.879.986.884.977.471.476.484.287.384.467.278.9
CL#1-2GaitSet29.443.149.548.742.340.344.947.443.035.725.640.9
GaitSlice42.955.762.259.154.951.355.655.953.648.435.452.3
GaitGL46.958.766.665.458.354.159.562.761.357.140.657.4
Ours47.960.268.768.261.858.063.466.363.659.443.060.0
Table 5. Averaged rank-1 accuracies on OU-MVLP, excluding identical-view cases.
Table 5. Averaged rank-1 accuracies on OU-MVLP, excluding identical-view cases.
All 14 Gallery Views
15°30°45°60°75°90°180°195°210°225°240°255°270°Mean
GEINet11.429.141.545.539.541.838.914.933.143.245.639.440.536.335.8
GaitSet79.587.989.990.288.188.787.881.786.789.089.387.287.886.287.1
GaitPart 82.688.990.891.089.789.989.585.288.190.090.189.089.188.288.7
GaitSlice 84.189.091.291.690.689.989.885.789.390.690.789.889.688.589.3
GaitGL84.990.291.191.591.190.890.388.588.690.390.489.689.588.889.7
Ours86.390.891.391.691.291.090.789.489.590.590.690.089.889.390.1
Table 6. Ablation experiments for Multi-scale Temporal Aggregation. Control condition: the different convolutional combination strategies.
Table 6. Ablation experiments for Multi-scale Temporal Aggregation. Control condition: the different convolutional combination strategies.
Group Akernel_1kernel_2kernel_3NMBGCL
a(3,1,1)--97.594.584.5
b-(5,1,1)-97.594.684.2
c--(7,1,1)97.794.584.1
d(3,1,1)(5,1,1)(7,1,1)97.694.884.9
Table 7. Ablation experiments for Metric-based Frame Attention Mechanism. Control condition: the different data normalization methods and different distance metric strategies. D E and D C : represent the Euclidean distance and cosine similarity, respectively. N m and N z : represent the max-min normalized and Z-score normalized, respectively.
Table 7. Ablation experiments for Metric-based Frame Attention Mechanism. Control condition: the different data normalization methods and different distance metric strategies. D E and D C : represent the Euclidean distance and cosine similarity, respectively. N m and N z : represent the max-min normalized and Z-score normalized, respectively.
DistanceNormalizationNMBGCL
D E N m 97.694.884.9
N z 97.794.684.3
D C N m 97.594.784.2
N z 97.694.684.2
Table 8. The result when the number of input frames is different. Results are rank-1 accuracies (%) averaged on all 11 views, excluding identical-view cases.
Table 8. The result when the number of input frames is different. Results are rank-1 accuracies (%) averaged on all 11 views, excluding identical-view cases.
Number of FramesNMBGCL
1057.450.433.2
2086.380.160.8
3094.590.176.6
5096.993.482.5
7097.494.383.5
10097.594.583.9
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wei, T.; Li, R.; Zhao, H.; Chen, R.; Zhan, J.; Li, H.; Wan, J. Metric-Based Key Frame Extraction for Gait Recognition. Electronics 2022, 11, 4177. https://doi.org/10.3390/electronics11244177

AMA Style

Wei T, Li R, Zhao H, Chen R, Zhan J, Li H, Wan J. Metric-Based Key Frame Extraction for Gait Recognition. Electronics. 2022; 11(24):4177. https://doi.org/10.3390/electronics11244177

Chicago/Turabian Style

Wei, Tuanjie, Rui Li, Huimin Zhao, Rongjun Chen, Jin Zhan, Huakang Li, and Jiwei Wan. 2022. "Metric-Based Key Frame Extraction for Gait Recognition" Electronics 11, no. 24: 4177. https://doi.org/10.3390/electronics11244177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop