# Attention-Based 3D Human Pose Sequence Refinement Network

^{*}

## Abstract

**:**

## 1. Introduction

- We propose a novel method to refine a 3D human pose sequence consisting of 3D rotations of joints. The proposed method performs human pose refinement independently from existing 3D human pose estimation methods. It can be applied to the results of any existing method in a model-agnostic manner and is easy to use.
- The proposed method is based on a simple but effective weighted-averaging operation and generates interpretable affinity weights using a non-local attention mechanism.
- In accordance with our experimental results, the proposed method consistently improves the 3D pose estimation and mesh reconstruction performance (i.e., accuracy and smoothness of output sequences) of existing methods for various real datasets.

## 2. Related Work

**Human mesh reconstruction.**Many recent 3D human mesh reconstruction methods directly regress the parameters of statistical shape models, such as SMPL [1]. These methods can be broadly classified into a single image-based approach [7,8,9,10] and a video-based approach [2,3,4].

**Non-local attention.**Non-local attention was proposed to model long-range dependency in natural language processing [11,12] and computer vision [13,14,15]. Vaswani et al. [12] proposed the transformer, which is a framework using only attention mechanisms to overcome the limitations of existing recurrent models for natural language processing tasks and successfully solves the long-range dependency problem. Recently, the transformer architecture is known to improve image recognition performance and is actively used for various computer vision tasks [16,17,18,19,20]. Wang et al. [13] attempted to model the long-range dependency in image features using non-local operations proposed in [21]. For this, a non-local block based on attention mechanisms was proposed. On this basis, the method in [12] can be regarded as a special case of non-local neural networks. In the study of Cao et al. [14], the position-wise attention map of [13] was analyzed qualitatively, and most of the attention maps of each position have similar attention aspects. On this Basis, a more efficient non-local attention block was proposed. Woo et al. [15] proposed a method that extracts new features by successively applying channel attention and spatial attention to input features. This method shows a stronger representation power than features based on existing fully convolutional baselines. Our method generates a temporal non-local attention map inspired by [13,21]. The generated attention weights suppress features that are useless for refinement and strengthen helpful features. Our method can refine noisy pose parameter sequences through this attention mechanism.

**Human pose refinement.**The goal of human pose refinement studies is mainly to refine an estimated sparse joint set. Existing pose refinement methods are included as part of the joint regression network or used as a post-processing module for inference results. Newell et al. [22] proposed a network in which several hourglass modules are stacked. Hourglass module repeats top-down and bottom-up processing, extracts features at various scales, and is trained with intermediate supervision. Each stage module generates a heatmap, which is used as input to the next stage module for refinement. Chen et al. [23] proposed a cascaded pyramid network that combines GlobalNet, a Resnet-based pyramid network, and RefineNet, which refines the heatmap generated by GlobalNet. RefineNet considers all features obtained from each step of the pyramid to find occluded joint positions that are difficult to estimate. Moon et al. [24] proposed a model-agnostic refinement model based on the error distribution of 2D pose estimation models investigated in Ronchi et al.’s work [25]. This method is independent on the pose estimation model because it does not work in an end-to-end manner, and pose estimation performance can be improved for various existing approaches. Mall et al. [26] proposed a method to refine noisy motion capture data. The proposed network consisting of linear layers and bidirectional long short-term memories regresses the standard deviation of a Gaussian kernel to improve the pose of a current frame. The proposed method obtains a denoised pose using the Gaussian kernel obtained through this network to calculate a temporally weighted sum for an input noisy pose sequence. In [26], 3D human pose is represented in the form of 126 joint angles, and the weighted sum is computed for this joint angle sequence. Our work provides a more reliable basis for computation in non-Euclidean space where 3D rotation actually exists. While the values of weights are limited by the Gaussian kernel in [26], they are not in our method.

## 3. Proposed Method

#### 3.1. SMPL Module

#### 3.2. Weight-Regression Module

**Network structure.**The weight-regression module of HPR-Net generates weights for pose refinement of the target pose from an input noisy pose chunk. Figure 3 shows the detailed structure of the weight-regression module consisting of 1D temporal convolution layer with a kernel size of 3, layer normalization [28], rectified linear unit activation, and self-attention layer. Suppose that $\Phi ={\{{{\beta}}_{i},{{\theta}}_{\mathit{i}}\}}_{\mathit{i}=0}^{\mathit{N}-1}$, which is a chunk of length N for the noisy SMPL parameter sequence, is given. Here, ${{\beta}}_{\mathit{i}}\in {\mathbb{R}}^{10}$ and ${{\theta}}_{\mathit{i}}\in {\mathbb{R}}^{72}$ are the identity and pose parameters in the i-th frame, respectively. The pose parameter ${{\theta}}_{\mathit{i}}$ represents the 3D rotations for 24 SMPL joints represented in an axis-angle form. We first convert ${{\theta}}_{\mathit{i}}$ to pose parameter ${{p}}_{i}\in {\mathbb{R}}^{96}$ in a unit quaternion form. We apply frame-wise positional encoding to the unit quaternion pose chunk $\mathit{P}=[{{p}}_{0},\dots ,{{p}}_{\mathit{N}-1}]\in {\mathbb{R}}^{96\times \mathit{N}}$, similar to [12], before feeding it into the network. Specifically, to inject positional information into P, we concatenate a relative position index vector $[-\lfloor \frac{\mathit{N}}{2}\rfloor ,\dots ,-1,0,1,\dots ,\lfloor \frac{\mathit{N}}{2}\rfloor ]$ with P to construct $\tilde{P}\in {\mathbb{R}}^{97\times \mathit{N}}$ and feed the concatenated tensor into the weight-regression module. The weight-regression module first computes temporal feature $\mathit{H}=[{{h}}_{0},{{h}}_{1},\dots ,{{h}}_{\mathit{N}-1}]\in {\mathbb{R}}^{24\times \mathit{N}}$ from $\tilde{P}$ through three 1D temporal convolution layers. ${{h}}_{\mathit{i}}\in {\mathbb{R}}^{24}$ represents the temporal feature of the i-th frame. A pose affinity vector ${w}\in {\mathbb{R}}^{\mathit{N}}$ is generated through a non-local self-attention mechanism [12,13] as follows:

**Why do we use LayerNorm?**From our experiments, we observed that the use of layer normalization after the convolution layer shows higher performance than the commonly used batch normalization [29]. In our method, the 3D pose in an input pose chunk consists of 3D rotations, and this 3D rotation is represented in a unit quaternion form that is geometrically on a 4D unit sphere. Layer normalization helps to learn the weight-regression module by enforcing the features extracted through the convolution layer to be on the unit sphere.

#### 3.3. Weighted-Averaging Module

**Pose refinement by weighted averaging.**Using ${w}$ generated by the weight-regression module, we perform weighted averaging on the input pose chunk P and obtain the refined pose ${y}\in {\mathbb{R}}^{96}$ as follows. Figure 4 shows the detailed structure of the weighted-averaging module. Weighted averaging cannot be directly applied to 3D rotations because they are defined in non-Euclidean space. Therefore, we obtain a second-order approximation of optimal rotation averaging by performing weighted averaging based on unit quaternion following Gramkow’s work [6]. By weighted averaging, we first obtain $\tilde{{y}}$ as follows:

**Loss functions.**The refined 3D human pose ${y}$ is converted to an axis-angle form and then fed into the SMPL module along with the identity parameter ${\beta}$ estimated by other methods to generate the refined mesh $\widehat{M}$ and 3D joints ${\widehat{X}}_{3\mathit{d}}=[{\widehat{{x}}}_{3\mathit{d},1},\dots ,{\widehat{{x}}}_{3\mathit{d},14}]$. The joint loss function ${\mathit{L}}_{\mathrm{joint}}$ for learning the proposed network is defined as follows:

## 4. Experimental Results

#### 4.1. Datasets and Evaluation Metrics

#### 4.2. Implementation Details

#### 4.3. Ablation Study

**Pose chunk length.**To determine the optimal length N of the pose chunk, we perform training using various lengths and analyze the results. Table 1 shows the performance in accordance with the length of the input chunk. HPR-Net shows the best performance with length 17, except for PA-MPJPE. Thus, we set the pose chunk length to 17.

**Various loss combinations.**In the proposed method, only 3D joints are supervised to train HPR-Net using the joint loss function in Equation (4). To justify this condition, we conduct an experiment to investigate how various combinations of loss functions affect the performance of HPR-Net. Specifically, we perform direct supervision with the joint loss function ${\mathit{L}}_{\mathrm{joint}}$ and losses that can be defined using the outputs of HPR-Net. The mesh loss function ${\mathit{L}}_{\mathrm{mesh}}$ and the pose loss function ${\mathit{L}}_{\mathrm{pose}}$ are additionally defined as follows:

**Positional encoding.**Most of the non-local attention-based methods inject positional information into their input. HPR-Net performs positional encoding, which helps to distinguish the pose of each frame in input pose chunk. We investigate the effect of positional encoding and its method on the performance of HPR-Net. Table 3 shows the performance of HPR-Net in accordance with the positional encoding method. For the experiment, we train and evaluate with three different models, one without positional encoding (None), one with sinusoidal positional encoding according to [12] (Sinusoidal), and one with positional encoding used in the proposed method (Ours). When positional encoding is not used, HPR-Net shows decreased PA-MPJPE performance compared with VIBE, but the other metrics are improved. Using the sinusoidal positional encoding shows improved results and best performance on PA-MPJPE. Our encoding method shows slightly lower PA-MPJPE compared with the sinusoidal positional encoding, but the best performance on the other metrics.

**Layer normalization.**The weight-regression module is composed of simple 1D temporal convolution layers. Layer normalization is adopted as the feature normalization layer of the proposed weight-regression module. To justify the use of layer normalization for HPR-Net, we trained three models, one without feature normalization, one using batch normalization, and one using layer normalization. Table 4 shows the performance comparison in accordance with the normalization method used in HPR-Net. When layer normalization is used, HPR-Net achieves the best performance in all metrics compared with other methods. From the result, layer normalization helps the learning of the weight-regression module.

#### 4.4. Refinement on State-of-the-Art Methods

#### 4.5. Comparison with Other Pose Refinement Methods

#### 4.6. Network Design Based on Non-Local Attention

#### 4.7. Qualitative Results

**Acceleration error improvement.**HPR-Net consistently shows a significant improvement in acceleration error across all methods and datasets on the basis of quantitative results. We present qualitative improvement results using a graph. Figure 6 shows the acceleration error of VIBE, SPIN, MEVA and their refined results after applying HPR-Net to each method. The acceleration errors are calculated for every three consecutive frames from a video of 3DPW. Compared with existing methods’ result, HPR-Net effectively improves the acceleration errors for all methods. In particular, the acceleration error is significantly reduced in frames with high peaks where the errors are noticeable.

**Refinement result.**We present the qualitative results to show that HPR-Net substantially refines a 3D human pose sequence estimated by existing methods. Figure 7 and Figure 8 show the refined results for VIBE and SPIN, respectively. For each example in Figure 7 and Figure 8, the top, middle, and bottom rows show the input image sequence, the estimation result by the existing method, and the refinement result by the proposed HPR-Net, respectively. We do not report the qualitative result for MEVA, because the SMPL estimation results by MEVA’s official code are projected incorrectly in the image. In the topmost example of Figure 7, a pedestrian causes occlusion. Thus, the pose of the target subject is incorrectly estimated. HPR-Net refines the results by reconstructing the appropriate pose using the information of nearby frames. In the top-left example of Figure 8, SPIN predicts the global orientation incorrectly due to challenging illumination. This incorrect global orientation is well refined in the result of HPR-Net. From the other results, HPR-Net refines the incorrect estimations of arms and legs.

#### 4.8. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. ACM Trans. Graph. (TOG)
**2015**, 34, 1–16. [Google Scholar] [CrossRef] - Kanazawa, A.; Zhang, J.Y.; Felsen, P.; Malik, J. Learning 3D human dynamics from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5614–5623. [Google Scholar]
- Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5253–5263. [Google Scholar]
- Luo, Z.; Golestaneh, S.A.; Kitani, K.M. 3D human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Hartley, R.; Trumpf, J.; Dai, Y.; Li, H. Rotation averaging. Int. J. Comput. Vis.
**2013**, 103, 267–305. [Google Scholar] [CrossRef] - Gramkow, C. On averaging rotations. J. Math. Imaging Vis.
**2001**, 15, 7–16. [Google Scholar] [CrossRef] - Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heisenberg, Germany, 2016; pp. 561–578. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
- Kanazawa, A.; Black, M.J.; Jacobs, D.W.; Malik, J. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7122–7131. [Google Scholar]
- Kolotouros, N.; Pavlakos, G.; Black, M.J.; Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 2252–2261. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv
**2017**, arXiv:1706.03762. [Google Scholar] - Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv
**2020**, arXiv:2010.04159. [Google Scholar] - Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heisenberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. arXiv
**2020**, arXiv:2012.00364. [Google Scholar] - Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 1691–1703. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 60–65. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heisenberg, Germany, 2016; pp. 483–499. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
- Moon, G.; Chang, J.Y.; Lee, K.M. Posefix: Model-agnostic general human pose refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7773–7781. [Google Scholar]
- Ruggero Ronchi, M.; Perona, P. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 369–378. [Google Scholar]
- Mall, U.; Lal, G.R.; Chaudhuri, S.; Chaudhuri, P. A deep recurrent framework for cleaning motion capture data. arXiv
**2017**, arXiv:1712.03380. [Google Scholar] - Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed] - Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv
**2016**, arXiv:1607.06450. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3D human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
- Gower, J.C. Generalized procrustes analysis. Psychometrika
**1975**, 40, 33–51. [Google Scholar] [CrossRef] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]

**Figure 1.**This figure shows a 3D human mesh sequence estimated by VIBE (

**top row**) and its refined result by our proposed method (

**bottom row**). In the 3rd frame, VIBE fails to estimate the correct pose of the target person due to severe occlusion. Our method effectively refines the incorrectly estimated results.

**Figure 2.**Overall framework of the proposed method. The input to our model is a noisy 3D human pose sequence estimated by existing 3D human pose estimation methods. Our proposed HPR-Net refines the noisy 3D human pose sequence and generates a refined human pose sequence.

**Figure 3.**Detailed pipeline of the weight-regression module. ⊗ represents matrix multiplication. First, the weight-regression module concatenates positional information to an input pose chunk. Second, the positional encoded input chunk is fed into the weight-regression module that consists of three 1D temporal convolution layers. Finally, pose affinity vector is generated from the output temporal feature of the convolution layers.

**Figure 4.**Detailed pipeline of the weighted-averaging module. ⊙ represents element-wise multiplication with broadcasting. $\Sigma $ represents summation for across time dimension. Input pose vectors P are multiplied with pose affinity weights ${w}$ which are generated by the weight-regression module. Then weighted pose vectors are added to output a refined pose vector $\tilde{{y}}$. To ensure that the refined pose parameters consist of unit quaternions, we additionally normalize $\tilde{{y}}$ to output a valid pose vector ${y}$.

**Figure 5.**Detailed pipelines of multi-head structure (

**a**), linear projection structure (

**b**), and our proposed HPR-Net’s structure (

**c**) for network design experiment. We did not apply linear projection to input pose chunk P in (

**a**–

**c**), because it should be averaged with affinity weights. Attention head contains affinity vector generation by self-attention and weighted-averaging processes.

**Figure 6.**Comparison of acceleration error between HPR-Net and previous methods (VIBE, SPIN, and MEVA). HPR-Net effectively suppresses acceleration error for all methods, even there are very high peaks of acceleration error.

**Figure 7.**Input images (

**top**) and reconstruction results of VIBE (

**middle**, gray SMPL mesh) and HPR-Net (

**bottom**, yellow SMPL mesh) on the 3DPW dataset.

**Figure 8.**Input images (

**top**) and reconstruction results of SPIN (

**middle**, gray SMPL mesh) and HPR-Net (

**bottom**, yellow SMPL mesh) on the 3DPW dataset.

**Table 1.**Performance comparison of HPR-Net according to different pose chunk length. Bold values indicate best results.

Length | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

9 | 82.14 | 51.82 | 98.25 | 7.31 |

17 | 81.10 | 51.26 | 97.13 | 6.94 |

33 | 82.11 | 51.63 | 98.26 | 18.36 |

65 | 81.23 | 50.97 | 97.24 | 8.19 |

129 | 81.81 | 51.35 | 97.89 | 11.69 |

**Table 2.**Performance comparison of HPR-Net according to various combinations of loss functions. (✓ = $1.0$, blank = $0.0$). Bold values indicate best results.

${\mathit{\lambda}}_{\mathit{j}}$ | ${\mathit{\lambda}}_{\mathit{m}}$ | ${\mathit{\lambda}}_{\mathit{p}}$ | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|---|---|

✓ | 85.60 | 55.03 | 102.07 | 12.39 | ||

✓ | 81.29 | 51.04 | 97.33 | 7.72 | ||

✓ | 81.10 | 51.26 | 97.13 | 6.94 | ||

✓ | ✓ | 84.37 | 54.12 | 100.76 | 10.38 | |

✓ | ✓ | 81.46 | 51.16 | 97.48 | 9.79 | |

✓ | ✓ | 85.64 | 55.20 | 102.12 | 12.11 | |

✓ | ✓ | ✓ | 83.76 | 53.39 | 100.06 | 14.12 |

**Table 3.**Comparison of refinement performance of HPR-Net according to positional encoding method. Bold values indicate best results.

Methods | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

None | 81.63 | 52.00 | 97.72 | 6.97 |

Sinusoidal | 81.53 | 51.15 | 97.58 | 8.42 |

Ours | 81.10 | 51.26 | 97.13 | 6.94 |

**Table 4.**Comparison of refinement performance of HPR-Net according to feature normalization method. Bold values indicate best results.

Methods | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

None | 82.12 | 51.84 | 98.17 | 7.76 |

BatchNorm | 82.66 | 52.07 | 98.81 | 12.93 |

LayerNorm | 81.10 | 51.26 | 97.13 | 6.94 |

**Table 5.**HPR-Net’s pose refinement performance for various existing methods on 3DPW test data. Bold values indicate performance improvements.

Methods | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

VIBE | 82.28 | 51.72 | 98.42 | 20.69 |

VIBE + HPR-Net | 81.10 | 51.26 | 97.13 | 6.94 |

SPIN | 102.46 | 60.05 | 129.22 | 29.78 |

SPIN + HPR-Net | 100.95 | 59.30 | 127.58 | 8.19 |

MEVA | 85.81 | 53.54 | 102.18 | 14.37 |

MEVA + HPR-Net | 85.43 | 53.50 | 101.79 | 6.63 |

**Table 6.**HPR-Net’s pose refinement performance for various existing methods on Human3.6M test data. Bold values indicate performance improvements.

Methods | MPJPE ↓ | PA-MPJPE ↓ | Accel-Error ↓ |
---|---|---|---|

VIBE | 78.35 | 53.58 | 9.76 |

VIBE + HPR-Net | 77.77 | 53.17 | 2.13 |

SPIN | 68.22 | 46.16 | 14.21 |

SPIN + HPR-Net | 67.35 | 45.53 | 2.74 |

MEVA | 73.64 | 48.48 | 7.22 |

MEVA + HPR-Net | 73.06 | 48.06 | 1.83 |

**Table 7.**Comparison of refinement performance between HPR-Net and other pose sequence refinement methods on the 3DPW dataset. Bold values indicate best results.

Methods | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

SLERP | 82.72 | 52.13 | 99.88 | 12.38 |

HPR-Gaussian | 82.15 | 51.58 | 98.30 | 18.04 |

HPR-DR | 183.01 | 102.79 | 223.20 | 14.28 |

HPR-Net | 81.10 | 51.26 | 97.13 | 6.94 |

**Table 8.**Comparison of refinement performance according to network design of HPR-Net. Bold values indicate best results.

Methods | MPJPE ↓ | PA-MPJPE ↓ | MPVE ↓ | Accel-Error ↓ |
---|---|---|---|---|

MHA | 84.00 | 53.20 | 99.94 | 7.71 |

SHA | 84.13 | 53.52 | 100.44 | 7.49 |

HPR-Net | 81.10 | 51.26 | 97.13 | 6.94 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, D.-Y.; Chang, J.-Y.
Attention-Based 3D Human Pose Sequence Refinement Network. *Sensors* **2021**, *21*, 4572.
https://doi.org/10.3390/s21134572

**AMA Style**

Kim D-Y, Chang J-Y.
Attention-Based 3D Human Pose Sequence Refinement Network. *Sensors*. 2021; 21(13):4572.
https://doi.org/10.3390/s21134572

**Chicago/Turabian Style**

Kim, Do-Yeop, and Ju-Yong Chang.
2021. "Attention-Based 3D Human Pose Sequence Refinement Network" *Sensors* 21, no. 13: 4572.
https://doi.org/10.3390/s21134572