Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion

Wang, Yaming; Wang, Minjie; Huang, Wenqing; Ye, Xiaoping; Jiang, Mingfeng

doi:10.3390/math10203794

Open AccessArticle

Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion

by

Yaming Wang

^1,2,

Minjie Wang

¹,

Wenqing Huang

^1,*

,

Xiaoping Ye

² and

Mingfeng Jiang

¹

Pattern Recognition and Computer Vision Lab, Zhejiang Sci-Tech University, Hangzhou 310000, China

²

Key Laboratory of Digital Design and Intelligent Manufacture in Culture & Creativity Product of Zhejiang Province, Lishui University, Lishui 323000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(20), 3794; https://doi.org/10.3390/math10203794

Submission received: 6 September 2022 / Revised: 28 September 2022 / Accepted: 1 October 2022 / Published: 14 October 2022

(This article belongs to the Special Issue Computer Vision and Pattern Recognition with Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Dense non-rigid structure from motion (NRSfM) has long been a challenge in computer vision because of the vast number of feature points. As neural networks develop rapidly, a novel solution is emerging. However, existing methods ignore the significance of spatial–temporal data and the strong capacity of neural networks for learning. This study proposes a deep spatial–temporal NRSfM framework (DST-NRSfM) and introduces a weighted spatial constraint to further optimize the 3D reconstruction results. Layer normalization layers are applied in dense NRSfM tasks to stop gradient disappearance and hasten neural network convergence. Our DST-NRSfM framework outperforms both classical approaches and recent advancements. It achieves state-of-the-art performance across commonly used synthetic and real benchmark datasets.

Keywords:

dense non-rigid structure from motion; weighted spatial constraint; layer normalization layers; deep spatial–temporal neural

MSC:

68T07; 68T45

1. Introduction

Non-rigid structure from motion (NRSfM) recovers the non-rigid 3D structure and camera pose from given monocular 2D point trajectories. It is ill-posed by its inherent ambiguities, and requires motion and deformation cues as well as prior assumptions to achieve 3D reconstruction. Moreover, it can be further subdivided into sparse NRSfM and dense NRSfM. Dense NRSfM is more complex and advances more slowly because it must reconstruct thousands of times more feature points compared with the sparse NRSFM. Most algorithms that address the dense NRSfM issue [1,2,3,4,5] extend the sparse reconstruction approach, and generally achieve unsatisfactory results. Recent developments have been made in dense NRSfM-specific algorithms [6,7,8,9]. Additionally, gradually introducing the neural network into the NRSfM area has provided new insights for resolving the dense NRSfM issue, despite the current primary focus on tackling sparse NRSfM.

The first 3D reconstruction of dense non-rigid objects from motion using an unsupervised neural network was completed by V. Sidhu [9] in 2020, and he gave the NRSfM network framework its moniker, neural non-rigid structure from motion (N-NRSfM). He developed a neural network framework for fitting deformations, employed latent space constraints for dense NRSfM, and obtained excellent reconstruction results. However, the final results heavily rely on conventional algorithms that first derive shape priors because of the reconstruction requirement for a rigid three-dimensional prior. This severely restricts the neural network’s ability to learn, thereby rendering these approaches less robust.

This study presents a deep spatiotemporal NRSfM framework (DST-NRSfM) network structure to address the dense NRSfM problem. The proposed DST-NRSfM is trained by an entirely unsupervised drawing on the auto-encoder [10,11] principle, which uses layer normalization layers to both prevent gradient disappearance and accelerate neural network convergence; additionally, it uses temporal–spatial constraints to enhance the 3D reconstruction output accuracy. The complete network model is divided into two modules: shape estimation and rotation estimation. These two modules, respectively, derive the 3D structure and camera posture of the reconstructed object from an input 2D point trajectory matrix. The shape-estimation module abandons the shape prior and builds the most appropriate end-to-end network model for dense objects based on the properties of distinct datasets. In the rotation estimation module, considering the uniqueness of the NRSfM problem, we implemented a rotation matrix corrector after the rotation estimation network. This corrector will further orthogonally limit the resulting rotation matrix and eliminate its potential reflection ambiguity. Experiments demonstrated that our proposed method outperformed rival methods and achieved cutting-edge outcomes on the most popular dense datasets. What follows is a summary of the primary contributions of this study.

We propose a novel deep neural network framework called DST-NRSfM to address the dense NRSfM challenge, which is superior for reconstructing objects with significant deformations because it does not require shape priors.
We introduce the weighted spatial constraint to minimize the reconstruction error, improve mesh quality, and ensure that the reconstructed item is closer to the original shape. The second set of comparison experiments in Section 4.2 demonstrates the success of the weighted spatial constraint.
The proposed technique achieves state-of-the-art performance on the most popular dense benchmark datasets. Section 4.3 provides evidence to support this hypothesis.

The remainder of this study is organized as follows. First, Section 2 discusses related work. Section 3 introduces problem settings, reviews the NRSfM problem and subsequently provides a detailed description of our network architecture. Section 4 details the experiments, presents the results, and analyzes the model’s overall performance. Section 5 summarizes the proposed approach.

2. Related Works

NRSfM. Utilizing prior knowledge of shape and camera motion is a popular method for solving the inherently ambiguous NRSfM problem. Note that each frame’s 3D shape is assumed to be composed of a few shape bases [12] or trajectory bases [13,14], and methods, such as probabilistic principal component analysis (PPCA) [15], metric constraints [16], complementary space models [17], block sparse dictionary learning [18], or force-based models [19], are employed to enhance the outcomes of the non-rigid 3D reconstruction produced via factoring. However, because the ideal number of bases is typically unpredictable (different bases occur in various sequences), this strategy is not robust. Recently, NRSfM has advanced significantly as the BMM algorithm [20], isometric priors [21], elastic priors [22], local-rigidity prior [23,24], space–time smooth constraints [25], variational methods [7], and tensor-based models [26] have been successively proposed.

Dense NRSfM. Reconstruction is more challenging for dense NRSfM, which is an extension of the NRSfM problem, owing to the significantly greater number of key points that must be restored. To address this issue, Collins et al. [27] suggested a segmentation approach that first uses a straightforward local model to reconstruct local patches and then stitches all local reconstruction results to create a smooth surface. However, sewing local reconstruction results together presents a hurdle. A template-based approach for reconstructing dense surfaces was proposed by Russell et al. [28] in 2012; however, in this approach, an appropriate 3D template must first be selected. The initial method for sparse reconstruction [1,2,3,4,5] was expanded to address the dense NRSfM problem because of the proliferation of NRSfM problem-solving techniques; however, the outcome remains subpar. Currently, an increasing number of studies are focused on solving the dense NRSfM problem; these include the following: (1) Agudo et al. [6] proposed modeling time-varying shapes using a probabilistic linear subspace of modal shapes obtained from continuum mechanics. (2) Garg et al. [7] treated the dense NRSfM problem as a global variational energy minimization problem by estimating the per-frame and camera motion matrices. (3) Kumar et al. [8,29,30] assumed that non-rigid deformations are collections of local linear subspaces and time, and used Grassmann manifolds to represent shapes as a set of smooth low-dimensional surfaces embedded in higher-dimensional Euclidean space, thus providing a practical solution and new theoretical insights for dense NRSfM. (4) Sidhu et al. [9] completed 3D reconstruction of dense non-rigid motion with an unsupervised neural network; excellent reconstruction results were obtained with the auto-decoder deformation model and latent-space constraints, but their method is not robust because it strongly depends on the mean shape.

NRSfM based on neural networks. Although the application of neural networks in solving dense NRSfM problems is still new, they are crucial in solving numerous NRSfM problems. In 2018, Cha et al. [31] developed a 3D unsupervised reconstruction network that could accurately predict the 3D results of omitted instances from the same category, and successfully reconstructed the 3D results of instances in a certain object category from 2D features using an orthographic camera model. In 2019, Facebook presented the C3DPO algorithm [32], which introduced a fresh regularization technique and substituted neural networks for traditional matrix decomposition. The deep NRSfM and deep NRSfM++ algorithms were proposed by Lucy et al. [33,34] to interpret NRSfM as multilayer sparse coding. By switching out the traditional low-rank assumption [18,35] for hierarchical block sparsity, which is more durable and comprehensible than single-level block sparsity. In 2020, Park et al. [36,37], inspired by the Procrustean alignment scheme [38], created a Procrustean regression network with a loss function that can automatically identify the correct rotation matrix. In 2021, Zeng et al. [39] first introduced a residual recurrent network (RRN) and two new loss compositions: pairwise contrast and pairwise consistency losses. RRN alone can accurately reconstruct non-rigid shapes and further improve the reconstruction results with two new losses. Wang et al. [5] used the 3D depth auto-encoder framework as the NRSfM prior and successfully recovered the 3D shape from a 2D image group with no time sequence and large deformation. Recently, Deng et al. [40] introduced a new technique for applying subspace joint structures, thereby enabling end-to-end batch training, and proposed a sequence-to-sequence translation perspective for deep NRSfM.

Although neural-network-based NRSfM has advanced significantly in recent years, it has focused primarily on the 3D reconstruction of sparse objects. There is still room for the development of dense NRSfM based on neural networks.

3. Methods

This section points out the problem setting of this study, reviews the classical architectures for the NRSfM problem to better explain the design rationale and subsequently describes our method in detail.

3.1. Preliminary

Problem setup. A network framework capable of dense NRSfM with deep neural networks without shape priors is needed. Specifically, given a dense non-rigid 2D trajectory matrix

W \in R^{2 F \times P}

, which contains the 2D coordinates of P keypoints in F frame, we want to recover the dense non-rigid 3D structure

S \in R^{3 F \times P}

.

Revisiting NRSfM. NRSfM aims to recover the 3D structure of an non-rigid object and the camera pose from a collection of 2D films acquired by a monocular camera. It can be represented by the following linear equation for orthogonal cameras:

W = R_{x y} S + t_{x y},

(1)

where

W \in R^{2 F \times P}

represents a 2D image sequence with P key-points in F frames and stacks F following

2 \times P

matrix:

W_{i} = [\begin{matrix} x_{i 1} & \dots & x_{i p} \\ y_{i 1} & \dots & y_{i p} \end{matrix}]

, where

{(x_{i j}, y_{i j})}^{T}

denotes the 2D coordinates of j-th key-point captured in the i-th frame;

S \in R^{3 F \times P}

is the stack of 3D shapes

S_{i}

of reconstructed objects for each frame, where

S_{i}

contains the 3D coordinates of every key-point in the i-th frame and

S_{i} = [\begin{matrix} x_{i 1} & \dots & x_{i p} \\ y_{i 1} & \dots & y_{i p} \\ z_{i 1} & \dots & z_{i p} \end{matrix}]

;

R_{x y} \in R^{2 F \times 3 F}

and

t_{x y} \in R^{2 F}

both are the

x - y

component of a rigid transformation.

R_{x y} = [\begin{matrix} R_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & R_{F} \end{matrix}]

, where

R_{i}

is the first two rows of the rotation matrix

R_{i}^{^{'}} \in R^{3 \times 3}

for the i-th frame; and the translation parameter

t_{x y}

of the camera can be eliminated by centralization when all the feature points are visible. Experiments in this study were implemented assuming that the camera translation component was eliminated.

When the reconstruction object of the NRSfM problem is a dense non-rigid object, both global and local features must be considered, and the computations required increase dramatically, thereby complicating the dense NRSfM. Kumar et al. [8,29,30] introduced the Grassmann manifold to address this problem, whereas Sidhu et al. [9] used the average face plus neural network method. This study developed an unsupervised deep neural network based on the basic theory of NRSfM. This network can deeply mine the relationship between dense video stream frames and frames, points, and points and leverage the learning capabilities of the neural network to achieve better reconstruction effects.

3.2. Deep Spatial-Temporal NRSfM

The architecture of the DST-NRSfM model, which operates end-to-end without requiring any real data for supervision, is shown in Figure 1. It takes the 2D feature point trajectory matrix W acquired by video optical flow processing as input and outputs the 3D structure S and the camera’s rotation matrix

R \in R^{3 F \times 3 F}

(shown in Figure 2). First, the reconstructor provides us with the initial reconstruction result; subsequently, we further optimize the reconstruction results in time and space according to the traits of high coupling between adjacent feature points of dense reconstruction objects and modest changes in neighboring frames. The outcomes of the second experiment in Section 4.2 indicate the effectiveness of the proposed spatial and temporal constraints.

3.2.1. Shape Estimator in Reconstructor

An illustration of the proposed shape estimator is shown in Figure 2a, which is inspired by the typical autoencoder neural network architecture and N-NRSfM [9]. However, our network structure fully utilizes the neural network’s inherent learning capacity, unlike N-NRSfM. With the addition of a feature encoder

f_{e}

and the removal of the shape priors that lead to over-dependence issues, the complete network can learn how to reconstruct the 3D structure of an non-rigid object from its 2D trajectory matrix W. The number B of neurons in the middlemost hidden layer of the whole network framework (i.e., the output layer of

f_{e}

) is set significantly lower than that of the input and output layers to ensure that the shape estimator first obtains the low-dimensional expression of the input data through

f_{e}

and then uses the feature decoder

f_{d}

to reconstruct the 3D structure from the low-dimensional expression. This successfully reduces the difficulty of the solution caused by the dense input data and allows the entire shape-estimation network to have a certain level of noise resistance. The entire shape-estimation technique is summarized as follows:

S_{i}^{^{'}} = f_{d} \circ f_{e} (W_{i}),

(2)

where

W_{i}

is reshaped into a

1 \times 2 P

feature tensor to be fed into the fully connected layers.

Network structure of the feature decoder $f_{d}$ . To improve the ability of the neural network to fit complex functions and better reconstruct the 3D structure of dense objects, we expanded the number of network layers based on the deformation auto-decoder [9], which is a series of nine fully connected layers with small hidden dimensions (2, 8, 8, 8, 16, 32, 32, 32,

|S_{i}|

), set the maximum number of layer nodes except for the output layer to 1024, used exponential linear units (ELU) as the activation units (except for the last layer), and added a layer normalization (LN) layer between each activation function and the neural network’s output. To alleviate the problem of network degradation induced by network deepening, we set the network to its residual form. When the number of neurons in the middle layer B is one, the entire network can be divided into four residual blocks. When

B = 64

, the entire network has a large residual block. The main architecture of

f_{d}

in Figure 3.

Network structure of the feature encoder $f_{e}$ . Feature encoder

f_{e}

is comparable to feature decoder

f_{d}

in that it uses a fully connected residual network with a LN layer added before the output of each layer of the neural network enters the activation function ELU. The cone-shaped wiring runs throughout the decoder network. The input layer nodes can reach 1024, whereas the output layer nodes can only go down to B, which is the number of neurons in the middlemost layer determined by the features of various datasets.

f_{e}

substitutes for the original shape prior and enhances the neural network’s ability to recover highly deformed objects, as demonstrated by the experimental results on the pants dataset in Section 4.3. The architecture of

f_{e}

is shown in Figure 3.

Configuration for the middlemost hidden layer. The size of the number B of neurons in the middlemost hidden layer represents the low-dimensional expression of the reconstructed object, and modifying B influences the function fitting ability of the neural network as the number of neural network parameters changes. A reasonable selection of B allows us to better reconstruct the target object. When B is extremely low, the original information is lost, whereas if B is set extremely high, the reconstruction result is easily disrupted by noise in the input data and severely deformed objects cannot be recovered.

However, the number of neurons in the middlemost hidden layer is not a fixed value. It changes with the characteristics of different datasets, and can only be based on experience or trials to obtain a better value. In this study, B is defined as corresponding to different types of datasets based on the features of various datasets and trials. The settings for B for each dataset are presented in Section 4.

Layer normalization layers. The 2D trajectory matrix of the entire sequence serves as the input to our network to maintain the temporal information of the input sequence, which makes neural network training more challenging. Additionally, although choosing the ELU activation function eliminates the issue of network mean shift, soft saturation in the negative range also contributes to gradient disappearance and sluggish convergence during model training. To address the aforementioned issues, we introduced an LN layer [41] according to the network topology and properties of the incoming data. After every layer of the shape-estimation network (except the last layer), we added LN layers for data normalization. The output of each layer of the shape-estimation network, following the addition of the LN layer, can be stated as follows, assuming that the activation function is f.

y = f (\frac{x - E (x)}{\sqrt{v a r (x) - ε}} \times g + b),

(3)

where

ε

is a small decimal value, which prevents the denominator from being 0. Herein, gain (g) and bias (b) ensure that the normalization process does not eliminate the preceding data. Both are altered by the model training and are selected by the neural network itself.

3.2.2. Rotation Estimation Network in Reconstructor

Figure 2b shows the rotation estimator module in the DST-NRSfM model, which can be divided into two sections: a rotation estimation network and rotation matrix corrector. The two sections collaborate to deduce the rotation matrix R. The rotation estimation network, which has only eight fully connected layers, and each layer is followed by an ELU (except the last layer), can be represented by the following formula.

\hat{R}_{i} = g_{ω} (W_{i}) .

(4)

Simultaneously, because the input data are captured by an orthogonal camera and reflection ambiguity, that is, if the rotation matrix at the i-th frame is assumed to be

R_{i}

, then both

R_{i}

and

- R_{i}

are valid rotation matrices, but the choice of different rotation symbols will result in different recovery shapes, that is,

S_{f} = [S_{f x}; S_{f y}; S_{f z}]

and

S_{f} = [- S_{f x}; - S_{f y}; - S_{f z}]

may appear when reconstructing the rotation matrix. Therefore, we solved these problems by adding a rotation matrix corrector after the rotation estimation module. The entire correction process is as follows.

In Figure 4, ① represents the singular value decomposition (SVD) of

\hat{R_{i}}

:

\hat{R_{i}} = U Σ V;

(5)

② imposes an orthogonal constraint, that is,

\tilde{R_{i}} {\tilde{R_{i}}}^{T} = 1

, on the rotation matrix:

\tilde{R_{i}} ⟵ U V^{T};

(6)

③ resolves the reflection ambiguity by placing the estimated rotation matrix into a special orthogonal group with

|R_{i}^{^{'}}| = 1

:

R_{i}^{^{'}} ⟶ U [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & |\tilde{R_{i}}| \end{matrix}] V^{T} .

(7)

Finally, the 3D structure

S_{i}^{^{'}}

estimated by the shape estimator network is reprojected to a 2D shape using the rotation matrix

R_{i}^{^{'}}

to measure the degree of agreement between the projection of the 3D reconstruction result and the given 2D feature point. The reprojection loss was calculated as follows.

L_{r e p r o j} = \sum_{i = 1}^{F} {∥W_{i} - P R_{i}^{^{'}} S_{i}^{^{'}}∥}_{ε},

(8)

where

{∥\cdot∥}_{ε}

denotes the Huber loss:

L_{δ} (e) = \{\begin{matrix} \frac{1}{2} e^{2}, & i f |e| \leq δ, \\ δ \cdot (|e| - \frac{1}{2} δ), & o t h e r w i s e, \end{matrix}

(9)

where

δ = 1

, and e usually stands for residuals. Because we performed the experiments using the orthogonal camera model, the projection matrix P can be expressed as:

P = [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix}] .

(10)

3.3. Temporal Constraint and Spatial Constraint

According to any dense NRSfM’s object that consists of dense feature points, we can assume that points in adjacent fields move in the same direction and most points between adjacent frames do not change significantly. Hence, we introduced temporal and spatial constraints to further improve the reconstruction outcomes.

Temporal constraint. The temporal constraint, also known as time-smoothing loss, limits the 3D trajectory smoothness of dense non-rigid reconstructions from the temporal dimension, as follows.

L_{t e m p} = \sum_{f = 1}^{F - 1} {∥ S_{f + 1} - S_{f} ∥}_{ε} .

(11)

Weighted spatial constraint. The spatial constraint aims to constrain the smoothness of 3D structures in each frame from the spatial dimension. Although the spatial smoothing described by Sidhu et al. [9] can achieve this effect, their algorithm causes undesirable deformations. Herein, we presented the weighted spatial constraint to improve the reconstruction performance. This guaranteed that the points in the local field of any vertex had the same motion trend and the local details of the reconstructed object can be preserved. These are all crucial for dense NRSfM, which brings the reconstruction result closer to the real value without deformation while ensuring smoothness.

L_{w_s p a t}

is given as:

L_{w_s p a t} = \sum_{f = 1}^{F} \sum_{i \in S_{f}} {∥i - \frac{\sum_{j \in N (i)} ω_{i j} * j}{\sum_{j \in N (i)} ω_{i j}}∥}_{ε},

(12)

ω_{i j} = \frac{1}{2} (cot α_{i j} + cot β_{i j}),

(13)

where

N (i)

represents the set of all vertices in 1-ring field of vertex i,

i \in S_{f}

; and

α_{i j}

and

β_{i j}

are the opposite corners of the two adjoining triangular patches of edge ij.

3.4. Alternative Training

The final training loss function L of DST-NRSfM is:

L = λ_{1} L_{r e p r o j} + λ_{2} L_{t e m p} + λ_{3} L_{w_s p a t},

(14)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weight coefficients that balance the influence of each loss function on the final structure. We discovered that training the networks with

L_{t e m p}

and

L_{w_s p a t}

yields superior results.

Finally, the priors that solve the dense NRSfM problem are significantly reduced, and the complexity of the rotation estimation network is reduced because of a suitable shape-estimation network.

4. Experiments

4.1. Implementation Details

Datasets. The following three types of datasets were used. (1) Widely used dense benchmarks with 3D ground truth

S_{g t}

: actor mocap (100 frames with 36,349 points) [42], synthetic faces syn3 and syn4 with different camera trajectories (99 frames with 28,000 points) [7], Kinect paper sequences (193 frames with 58,000 points) and t-shirt sequences (313 frames with 77,000 points) [43]. (2) Relatively sparse datasets with 3D ground truth geometry: expressions (384 frames with 997 points) [44] and pants (291 frames with 2859 points) [45], which are significantly deformed. (3) Real video sequence: back (150 frames with 20,474 points) [46], heart surgery sequence (201 frames with 55,115 points) [47], and real face (120 frames with 28,332 points) [7].

Evaluation indicator. We utilized the normalized error (NE) to assess the algorithm’s performance for datasets with three-dimensional true values, defined as

e_{3 D} = \frac{1}{F} \sum_{f = 1}^{F} \frac{{∥S_{e s t}^{f} - S_{g t}^{f}∥}_{F}}{{∥S_{g t}^{f}∥}_{F}},

(15)

where

S_{g t}^{f}

denotes the actual 3D structure of the f-th frame of the reconstructed object and

S_{e s t}^{f}

is the f-th frame of our anticipated 3D structure, which is aligned to the ground truth after using the Procrustes algorithm.

Training Details. We employed RProp [48] as the optimizer with a learning rate of 0.0001 because we input the entire dataset during training to effectively retain the spatiotemporal information. All proposed network frameworks were initialized with He [49]. The experiment set the weights to

λ_{1} = 1000

,

λ_{2} = 1

, and

λ_{3} = 1 \times 10^{- 5}

and trained for 60,000 epochs. The number B of neurons in the middlemost layer of the shape-estimation network, which is affected by the essential characteristics of the given dataset, extent of deformation, amount of noise, and degree of density, was selected empirically by the determined ranges

[1, 64]

throughout Section 4.2 and Section 4.3. In general, the larger the noise and deformation variables of the dataset, the closer B is to 1, and the sparser the dataset, and the smaller the value of B.

Experimental environment. DST-NRSfM was programmed using the PyTorch and trained on a server with 12 GB RAM, RTX2080Ti graphics card, and Ubuntu operating system.

4.2. Network Structure Analysis

The necessity for layer normalization layers. To demonstrate the importance of LN layers in the shape-estimation network, we designed a series of controlled trials for shape-estimation networks with and without LN layers using the same loss function on the synthetic faces syn3, the real dataset actor mocap, expressions (sparser), and the Kinect dataset paper. The convergence curves of each dataset are shown in Figure 5.

Figure 5 demonstrates how the LN layers cause the actor mocap and paper datasets to converge at a lower loss while improving the stability and speed of the expressions and syn3 dataset models. We effectively incorporated the LN layers into the proposed shape-estimation network. It improved the model’s problem-solving abilities, increased model stability, and hastened convergence.

Effectiveness of temporal constraint and weighted spatial constraint. We conducted ablation experiments on the loss function of the DST-NRSfM network to confirm the efficiency of the temporal constraint and the newly implemented weighted spatial constraint. Table 1 reports the experimental results. The results show that despite the reconstruction results being considerably improved by the temporal constraints, the extension of the weighted spatial constraint further enhanced the reconstruction accuracy. Figure 6a–d visualize the reconstruction results of actor mocap, expressions, syn3, and syn4 with and without spatial constraints in the presence

L_{r e p r o j}

and

L_{t e m p}

loss conditions. The results following the spatial constraints are more accurate and smoother, as can be seen in Figure 6a,c,d. In contrast, Figure 6b does not exhibit a notable qualitative difference because the expressions are sparser; however, careful inspection of the reconstructed point cloud (reconstructed points are yellow, and the reconstructed points after flipping are black) revealed that there were fewer flipped reconstruction points in the reconstruction results after the weighted spatial constraint. This suggests that the weighted spatial constraint aids in mesh quality improvement.

Analysis of model computational complexity. For smoother deployments in empirical applications, we analyzed the computational complexity of the DST-NRSfM model. As the complexity of the neural network model is mainly determined by two parameters: parameters (Params) and floating-point operations per second (FLOPS), we separately count the two parameters of the shape estimator and the rotation estimator, as shown in Table 2. It is clearly visible that the computational complexity of the model differs depending on the number of neurons in the middle layer and the input dimensions. We found that model complexity is strongly influenced by dataset density, and model complexity increases as dataset density increases. Therefore, the model we designed still has a lot of room for improvement in the frame structure.

4.3. Results and Analysis

Actor mocap. The actor mocap sequence, as real face data with

S_{g t}

, allows us to analyze the reconstruction results of our method qualitatively and quantitatively. The visible results are shown in the third image of Figure 6a. Comparing to actual images, our method successfully reconstructed the 3D structure of the actor mocap. Furthermore, we qualitatively compared our approach to face model learning (FML) [50], scalable monocular surface reconstruction (SMSR) [51], consolidating monocular dynamic reconstruction (CMDR) [52,53], optimization neural network (RONN) [54], and N-NRSfM [9]. Moreover,

e_{3 D}

for the actor mocap is listed in Table 3, which revealed that our method is significantly superior to other methods.

Synthetic Faces. The widely usage of synthetic faces allows us to compare DST-NRSfM to more methods. To more intuitively understand the efficacy of our approach, we compared the experimental outcomes of DST-NRSfM with classical sparse NRSfM methods, such as metric projections (MP) [55], complementary rank-3 spaces (CSF2) [17], block-matrix-method (BMM) [20], and organic priors based approach (OP) [3], traditional dense NRSfM methods, such as variational approach (VA) [7], dense spatio-temporal approach (DSTA) [25], CMDR [52,53], Grassmannian manifold (GM) [8], jumping manifolds (JM) [29], SMSR [51], and probabilistic point trajectory approach (PPTA) [14], and the latest neural-based dense NRSfM approaches, such as N-NRSfM [9], and RONN [54]. Table 4 presents the final comparative experimental results, where OP is the newest method that solves the dense NRSfM problem by extending the sparse approach to the dense domain; however, the accuracy of our proposed framework remained nearly twice as accuracy. Although DSTA also employs space-time limitations to enhance reconstruction accuracy, the reconstruction accuracy of the proposed framework was substantially higher than that of DSTA because our spatial constraint theory fundamentally differs from that proposed by DSTA; moreover, our solution is more flexible. Additionally, our approach performed better than state-of-the-art learning-based dense NRSfM approaches, despite syn4 being more challenging to reconstruct than syn3. The visual reconstruction outcomes for syn3 and syn4 are shown in Figure 6c,d, respectively.

Kinect Sequences. In contrast to the face sequences, the Kinect sequences exhibited larger deformations, more key points, and a simpler camera trajectory. To ensure a fair evaluation, we preprocessed the Kinect sequences as described by Kumar et al. [8] and ran a multi-frame optical flow [56] with default parameters to obtain dense correspondences.

e_{3 D}

for the Kinect sequences are presented in Table 5. Visualizations are shown in the second row of Figure 7. Our approach outperformed all competing methods in the Kinect paper sequence. In the t-shirt sequence, owing to insufficient device memory, the weighted spatial constraint cannot be applied, causing a suboptimal reconstruction effect, but the result remains resistant.

Expressions. As the expressions dataset is horizontally shifted and relatively sparse, we set

B = 2

. Using this dataset, we can compare our method with two new algorithms: Kernel Shape Trajectory Approach (KSTA) [57] and Global Model with Local Interpretation (GMLI) [44]. Herein,

e_{3 D}

for the expressions are presented in Table 6. The reconstruction results are shown in Figure 6b. We achieved

e_{3 D} = 0.023

, which is better than the best result of 0.026, for this sequence. Experiments demonstrated that DST-NRSfM can also be applied to sparser datasets with significant horizontal shifts.

Pants. The pants dataset collects sequences that document how pants appear when people jump and rotate. Extremely violent deformation makes the reconstruction extremely challenging. To highlight the benefits of DST-NRSfM in reconstructing severely distorted objects, we used it to reconstruct the pants dataset. Table 7 presents the reconstruction results quantitatively, and the visualization is shown in the last row of Figure 7. Experiments demonstrated that the proposed framework had the best reconstruction effect on this dataset, and accuracy was enhanced by nearly 20% over the second-place finisher. Note that these are merely experimental results produced without spatial constraints because the dataset lacks a mesh template.

Real video sequence. We constructed several real video sequences, such as the back, heart surgery sequence, and real face. We adjusted the size of B according to the features of each dataset and expertise. The real face sequence contains a significant amount of noise; therefore, we set

B = 1

. Figure 6e shows the effect of the size of B on the reconstruction results. For the back and heart surgery sequences, we assumed that their essential characteristics were 2; hence,

B = 2

. The back and heart surgery sequences are shown in the first row of Figure 7. Evidently, the DST-NRSfM generated highly realistic outcomes.

5. Conclusions

This study provided a novel end-to-end unsupervised DST-NRSfM for solving tough dense NRSfM. We discarded shape prior and designed a network architecture that leverages the neural network’s capacity for learning according to the properties of various datasets. Therefore, even for objects with significant deformations, the best reconstruction results may be achieved with only reprojection loss and temporal constraints. To further enhance the reconstruction results, we introduced a new spatial constraint called a weighted spatial constraint. This constraint enhances the reconstruction accuracy and the mesh quality and smooths out the reconstruction. Aiming at the problem of gradient disappearance and model instability found in the experiment, we implemented layer normalization layers in dense NRSfM. Quantitative and qualitative results demonstrated that our method achieves competitive performance with state-of-the-art dense NRSfM methods.

In DST-NRSfM the number of neurons in the middlemost layer of the shape-estimation network can only be hypothesized based on the features of different datasets, which is similar to the bases in the conventional NRSfM technique. The model framework consists of fully connected networks, resulting in redundancy of parameters and high computational complexity. Moreover, because it involves matrices of size

P \times P

, where P is the number of feature points and requires a mesh template for computation, the weighted spatial constraint approach cannot be utilized for arbitrary datasets.

DST-NRSfM substantially enhances the reconstruction accuracy of dense NRSfM and provides a broader perspective of using neural networks to address such issues. Future research will focus on enabling a neural network with low computational complexity to automatically select the most appropriate structure based on the dataset features.

Author Contributions

Conceptualization, Y.W., W.H. and M.J.; methodology, Y.W., W.H. and M.W.; software, M.W.; validation, Y.W., W.H. and M.J.; formal analysis, Y.W. and M.W.; investigation, M.W. and X.Y.; data curation, M.W. and X.Y.; writing—original draft preparation, M.W. and W.H.; writing—review and editing, Y.W., W.H. and M.W.; visualization, Y.W. and M.W.; supervision, Y.W. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Natural Science Foundation of Zhejiang Province (LZ20F020003, LZ21F020003, LSZ19F010001, LY17F020003), and the National Natural Science Foundation of China (61272311, 61672466).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Russell, C.; Fayad, J.; Agapito, L. Dense non-rigid structure from motion. In Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, Zurich, Switzerland, 13–15 October 2012; pp. 509–516. [Google Scholar]
Golyanik, V.; Stricker, D. Dense batch non-rigid structure from motion in a second. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 254–263. [Google Scholar]
Kumar, S.; Van Gool, L. Organic Priors in Non-Rigid Structure from Motion. arXiv 2022, arXiv:2207.06262. [Google Scholar]
Song, J.; Patel, M.; Jasour, A.; Ghaffari, M. A Closed-Form Uncertainty Propagation in Non-Rigid Structure From Motion. IEEE Robot. Autom. Lett. 2022, 7, 6479–6486. [Google Scholar] [CrossRef]
Wang, C.; Lucey, S. PAUL: Procrustean Autoencoder for Unsupervised Lifting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 434–443. [Google Scholar]
Agudo, A.; Montiel, J.; Agapito, L.; Calvo, B. Online Dense Non-Rigid 3D Shape and Camera Motion Recovery. In Proceedings of the BMVC, Nottingham, UK, 1–5 September 2014. [Google Scholar]
Garg, R.; Roussos, A.; Agapito, L. Dense variational reconstruction of non-rigid surfaces from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1272–1279. [Google Scholar]
Kumar, S.; Cherian, A.; Dai, Y.; Li, H. Scalable dense non-rigid structure-from-motion: A grassmannian perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 254–263. [Google Scholar]
Sidhu, V.; Tretschk, E.; Golyanik, V.; Agudo, A.; Theobalt, C. Neural dense non-rigid structure from motion with latent space constraints. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 204–222. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Shi, F.; Xiao, J.; Wu, W. Non-rigid structure-from-motion on degenerate deformations with low-rank shape deformation model. IEEE Trans. Multimed. 2014, 17, 171–185. [Google Scholar] [CrossRef]
Wang, Y.M.; Zheng, J.B.; Jiang, M.F.; Xiong, Y.L.; Huang, W.Q. A Trajectory Basis Selection Method for Non-Rigid Structure from Motion. In Applied Mechanics and Materials; Trans Tech Publications Ltd.: Bach, Switzerland, 2014; Volume 644, pp. 1396–1399. [Google Scholar]
Agudo, A.; Moreno-Noguer, F. A scalable, efficient, and accurate solution to non-rigid structure from motion. Comput. Vis. Image Underst. 2018, 167, 121–133. [Google Scholar] [CrossRef] [Green Version]
Torresani, L.; Hertzmann, A.; Bregler, C. Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 878–892. [Google Scholar] [CrossRef]
Paladini, M.; Del Bue, A.; Stosic, M.; Dodig, M.; Xavier, J.; Agapito, L. Factorization for non-rigid and articulated structure using metric projections. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2898–2905. [Google Scholar]
Gotardo, P.F.; Martinez, A.M. Non-rigid structure from motion with complementary rank-3 spaces. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3065–3072. [Google Scholar]
Kong, C.; Lucey, S. Prior-less compressible structure from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4123–4131. [Google Scholar]
Agudo, A.; Moreno-Noguer, F. Force-based representation for non-rigid shape and elastic model estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2137–2150. [Google Scholar] [CrossRef] [Green Version]
Dai, Y.; Li, H.; He, M. A simple prior-free method for non-rigid structure-from-motion factorization. Int. J. Comput. Vis. 2014, 107, 101–122. [Google Scholar] [CrossRef] [Green Version]
Parashar, S.; Pizarro, D.; Bartoli, A. Isometric non-rigid shape-from-motion in linear time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4679–4687. [Google Scholar]
Agudo, A.; Moreno-Noguer, F.; Calvo, B.; Montiel, J.M.M. Sequential non-rigid structure from motion using physical priors. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 979–994. [Google Scholar] [CrossRef] [Green Version]
Cha, G.; Lee, M.; Cho, J.; Oh, S. Non-rigid surface recovery with a robust local-rigidity prior. Pattern Recognit. Lett. 2018, 110, 51–57. [Google Scholar] [CrossRef]
Li, X.; Li, H.; Joo, H.; Liu, Y.; Sheikh, Y. Structure from recurrent motion: From rigidity to recurrency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3032–3040. [Google Scholar]
Dai, Y.; Deng, H.; He, M. Dense non-rigid structure-from-motion made easy—A spatial-temporal smoothness based solution. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 4532–4536. [Google Scholar]
Graßhof, S.; Brandt, S.S. Tensor-Based Non-Rigid Structure From Motion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3011–3020. [Google Scholar]
Collins, T.; Bartoli, A. Locally affine and planar deformable surface reconstruction from video. In Proceedings of the International Workshop on Vision, Modeling and Visualization, Siegen, Germany, 15–17 November 2010; pp. 339–346. [Google Scholar]
Bartoli, A.; Gérard, Y.; Chadebecq, F.; Collins, T. On template-based reconstruction from a single view: Analytical solutions and proofs of well-posedness for developable, isometric and conformal surfaces. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2026–2033. [Google Scholar]
Kumar, S. Jumping manifolds: Geometry aware dense non-rigid structure from motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5346–5355. [Google Scholar]
Kumar, S.; Van Gool, L.; de Oliveira, C.E.; Cherian, A.; Dai, Y.; Li, H. Dense Non-Rigid Structure from Motion: A Manifold Viewpoint. arXiv 2020, arXiv:2006.09197. [Google Scholar]
Cha, G.; Lee, M.; Oh, S. Unsupervised 3d reconstruction networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3849–3858. [Google Scholar]
Novotny, D.; Ravi, N.; Graham, B.; Neverova, N.; Vedaldi, A. C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7688–7697. [Google Scholar]
Kong, C.; Lucey, S. Deep non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1558–1567. [Google Scholar]
Wang, C.; Lin, C.H.; Lucey, S. Deep NRSfM++: Towards Unsupervised 2D-3D Lifting in the Wild. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020. [Google Scholar]
Kumar, S. Non-rigid structure from motion: Prior-free factorization method revisited. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2020; pp. 51–60. [Google Scholar]
Park, S.; Lee, M.; Kwak, N. Procrustean regression: A flexible alignment-based framework for nonrigid structure estimation. IEEE Trans. Image Process. 2017, 27, 249–264. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Lee, M.; Kwak, N. Procrustean regression networks: Learning 3d structure of non-rigid objects from 2d annotations. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–18. [Google Scholar]
Lee, M.; Cho, J.; Choi, C.H.; Oh, S. Procrustean normal distribution for non-rigid structure from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1280–1287. [Google Scholar]
Zeng, H.; Dai, Y.; Yu, X.; Wang, X.; Yang, Y. PR-RRN: Pairwise-Regularized Residual-Recursive Networks for Non-rigid Structure-from-Motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5600–5609. [Google Scholar]
Deng, H.; Zhang, T.; Dai, Y.; Shi, J.; Zhong, Y.; Li, H. Deep Non-rigid Structure-from-Motion: A Sequence-to-Sequence Translation Perspective. arXiv 2022, arXiv:2204.04730. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Valgaerts, L.; Wu, C.; Bruhn, A.; Seidel, H.P.; Theobalt, C. Lightweight binocular facial performance capture under uncontrolled lighting. ACM Trans. Graph. 2012, 31, 1–11. [Google Scholar] [CrossRef] [Green Version]
Varol, A.; Salzmann, M.; Fua, P.; Urtasun, R. A constrained latent variable model. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2248–2255. [Google Scholar]
Agudo, A.; Moreno-Noguer, F. Global model with local interpretation for dynamic shape reconstruction. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 264–272. [Google Scholar]
White, R.; Crane, K.; Forsyth, D.A. Capturing and animating occluded cloth. ACM Trans. Graph. 2007, 26, 34–es. [Google Scholar] [CrossRef]
Russell, C.; Fayad, J.; Agapito, L. Energy based multiple model fitting for non-rigid structure from motion. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 3009–3016. [Google Scholar]
Stoyanov, D. Stereoscopic scene flow for robotic assisted minimally invasive surgery. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2012; pp. 479–486. [Google Scholar]
Riedmiller, M.; Braun, H. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, CA, USA, 28 Maarch–1 April 1993; pp. 586–591. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 1–13 December 2015; pp. 1026–1034. [Google Scholar]
Tewari, A.; Bernard, F.; Garrido, P.; Bharaj, G.; Elgharib, M.; Seidel, H.P.; Pérez, P.; Zollhofer, M.; Theobalt, C. Fml: Face model learning from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10812–10822. [Google Scholar]
Ansari, M.D.; Golyanik, V.; Stricker, D. Scalable dense monocular surface reconstruction. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 78–87. [Google Scholar]
Golyanik, V.; Jonas, A.; Stricker, D. Consolidating segmentwise non-rigid structure from motion. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar]
Golyanik, V.; Jonas, A.; Stricker, D.; Theobalt, C. Intrinsic dynamic shape prior for fast, sequential and dense non-rigid structure from motion with detection of temporally-disjoint rigidity. arXiv 2019, arXiv:1909.02468. [Google Scholar]
Wang, Y.; Peng, X.; Huang, W.; Wang, M. A Convolutional Neural Network for Nonrigid Structure from Motion. Int. J. Digit. Multimed. Broadcast. 2022, 2022, 3582037. [Google Scholar] [CrossRef]
Paladini, M.; Del Bue, A.; Xavier, J.; Agapito, L.; Stošić, M.; Dodig, M. Optimal metric projections for deformable and articulated structure-from-motion. Int. J. Comput. Vis. 2012, 96, 252–276. [Google Scholar] [CrossRef]
Garg, R.; Roussos, A.; Agapito, L. A variational approach to video registration with subspace constraints. Int. J. Comput. Vis. 2013, 104, 286–314. [Google Scholar] [CrossRef] [PubMed]
Gotardo, P.F.; Martinez, A.M. Kernel non-rigid structure from motion. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 802–809. [Google Scholar]

Figure 1. Overall framework of the deep spatial–temporal non-rigid structure from motion (DST-NRSfM) model.

Figure 2. Schematic of the reconstructor: (a) Shape estimator for predicting the 3D structure of the reconstructed object. (b) Rotation estimator for estimating camera poses in each frame.

Figure 3. The architecture of the feature encoder

f_{e}

and feature decoder

f_{d}

. All neural networks framework increase or decrease exponentially. B is determined by the characteristics of the reconstructed object,

k = {log}_{2} B

, and

n \in (10, k)

. N and M mainly depend on the size of B, and

N \in [2, 4]

and

M \in [0, 3]

.

Figure 3. The architecture of the feature encoder

f_{e}

and feature decoder

f_{d}

. All neural networks framework increase or decrease exponentially. B is determined by the characteristics of the reconstructed object,

k = {log}_{2} B

, and

n \in (10, k)

. N and M mainly depend on the size of B, and

N \in [2, 4]

and

M \in [0, 3]

.

Figure 4. Rotation matrix corrector workflow.

Figure 5. Decline curve corresponding to whether to add layer normalization layers (LN) in the proposed network architecture with reprojection loss and temporal constraint in expressions, actor mocap, paper, and syn3. (a) expressions; (b) actor mocap; (c) paper; (d) syn3.

Figure 6. Visualized ablation and contrast experiment results. From left to right, genuine pictures, and 3D reconstruction without and with the weighted spatial constraint are displayed in (a–d). (a–d) depict the ablation experiment results of actor mocap, expressions, syn3, and syn4, respectively; the yellow points in (b) are the reconstructed point clouds, whereas the black points are the inverted 3D points. From left to right, (e) displays the true images of the real face dataset and the front and side views of the reconstruction results when the number of neurons in the middlemost layer of the shape-estimation network is

B = 1

and

B = 64

.

Figure 6. Visualized ablation and contrast experiment results. From left to right, genuine pictures, and 3D reconstruction without and with the weighted spatial constraint are displayed in (a–d). (a–d) depict the ablation experiment results of actor mocap, expressions, syn3, and syn4, respectively; the yellow points in (b) are the reconstructed point clouds, whereas the black points are the inverted 3D points. From left to right, (e) displays the true images of the real face dataset and the front and side views of the reconstruction results when the number of neurons in the middlemost layer of the shape-estimation network is

B = 1

and

B = 64

.

Figure 7. Visualizations of real sequence reconstruction results. The first row shows the reconstructed results of back and heart surgery sequences. The second row is the visualization of Kinect sequences’ reconstruction results. The third row shows the reconstruction effect of the heavily deformed sparse pants sequence, where DST-NRSfM achieves the lowest

e_{3 D}

among all examined approaches. The reconstruction result of the real face sequence is shown in Figure 6e.

Figure 7. Visualizations of real sequence reconstruction results. The first row shows the reconstructed results of back and heart surgery sequences. The second row is the visualization of Kinect sequences’ reconstruction results. The third row shows the reconstruction effect of the heavily deformed sparse pants sequence, where DST-NRSfM achieves the lowest

e_{3 D}

among all examined approaches. The reconstruction result of the real face sequence is shown in Figure 6e.

Table 1. Results of ablation experiments.

	syn3	syn4	Expressions	Actor Mocap	Paper
$L_{r e p r o j}$	0.1196	0.1201	0.2205	0.2624	0.0376
$L_{r e p r o j} + L_{t e m p}$	0.0149	0.0255	0.0273	0.0174	0.0342
$L_{r e p r o j} + L_{t e m p} + L_{w_s p a t}$	0.0134	0.0241	0.0226	0.0163	0.0296

Table 2. Computational complexity analysis of DST-NRSfM model.

	Dataset	Actor Mocap ( $B = 64$ )	syn3 ( $B = 64$ )	Paper ( $B = 2$ )	Expressions ( $B = 2$ )	Heart Surgery ( $B = 2$ )	Back ( $B = 2$ )	Real Face ( $B = 1$ )
Shape Estimator	FLOPS	$8.06 \times 10^{10}$	$6.40 \times 10^{9}$	$2.51 \times 10^{10}$	$1.88 \times 10^{9}$	$2.43 \times 10^{10}$	$7.00 \times 10^{9}$	$7.62 \times 10^{9}$
Shape Estimator	Params	$8.07 \times 10^{7}$	$6.47 \times 10^{7}$	$1.31 \times 10^{8}$	$4.90 \times 10^{7}$	$1.21 \times 10^{8}$	$4.67 \times 10^{7}$	$6.35 \times 10^{7}$
Rotation Estimator	FLOPS	$7.51 \times 10^{9}$	$5.93 \times 10^{9}$	$2.36 \times 10^{10}$	$1.05 \times 10^{9}$	$2.28 \times 10^{10}$	$6.39 \times 10^{9}$	$7.05 \times 10^{9}$
Rotation Estimator	Params	$7.51 \times 10^{7}$	$5.99 \times 10^{7}$	$1.22 \times 10^{8}$	$2.74 \times 10^{6}$	$1.14 \times 10^{8}$	$4.26 \times 10^{7}$	$5.87 \times 10^{7}$

Table 3.

e_{3 D}

on the the actor mocap.

Table 3.

e_{3 D}

on the the actor mocap.

Dataset	FML [50]	SMSR [51]	CMDR [52]	RONN [54]	N-NRSfM [9]	Ours ( $B = 64$ )
actor mocap	0.092	0.054	0.0257	0.0226	0.0181	0.0163

Table 4.

e_{3 D}

on the synthetic face sequences (syn3 and syn4).

Table 4.

e_{3 D}

on the synthetic face sequences (syn3 and syn4).

		Classical Sparse NRSfM Methods				Neural Dense NRSFM Methods
Dataset		MP [55]	CSF2 [17]	BMM [20]	OP [3]	N-NRSfM [9]	RONN [54]	Ours ( $B = 64$ )
syn3		0.0611	0.5474	0.0784	0.0279	0.0320	0.0309	0.0134
syn4		0.0762	0.5292	0.0918	0.0419	0.0389	0.0359	0.0263
Conventional Dense NRSfM Methods
Dataset	VA [7]	DSTA [25]	GM [8]	JM [29]	PPTA [14]	CMDR [52]	SMSR [51]	Ours( $B = 64$ )
syn3	0.0346	0.0374	0.0294	0.0280	0.0390	0.0324	0.0304	0.0134
syn4	0.0379	0.0428	0.0309	0.0327	0.052	0.0369	0.0319	0.0263

Table 5.

e_{3 D}

on the Kinect paper and t-shirt sequences.

Table 5.

e_{3 D}

on the Kinect paper and t-shirt sequences.

Dataset	MP [55]	DSTA [25]	GM [8]	JM [29]	N-NRSfM [9]	Ours
paper	0.0827	0.0612	0.0394	0.0338	0.0332	0.0296 ( $B = 2$ )
t-shirt	0.0741	0.0636	0.0362	0.0386	0.0309	0.0396 ( $B = 2$ )

Table 6.

e_{3 D}

on the expressions dataset.

Table 6.

e_{3 D}

on the expressions dataset.

Dataset	CSF2 [17]	KSTA [57]	GMLI [44]	N-NRSfM [9]	RONN [54]	Ours ( $B = 2$ )
expressions	0.030	0.035	0.026	0.026	0.026	0.023

Table 7.

e_{3 D}

on the pants dataset.

Table 7.

e_{3 D}

on the pants dataset.

Dataset	KSTA [57]	BMM [20]	PPTA [14]	Ours ( $B = 1$ )
pants	0.220	0.183	0.203	0.148

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, M.; Huang, W.; Ye, X.; Jiang, M. Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion. Mathematics 2022, 10, 3794. https://doi.org/10.3390/math10203794

AMA Style

Wang Y, Wang M, Huang W, Ye X, Jiang M. Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion. Mathematics. 2022; 10(20):3794. https://doi.org/10.3390/math10203794

Chicago/Turabian Style

Wang, Yaming, Minjie Wang, Wenqing Huang, Xiaoping Ye, and Mingfeng Jiang. 2022. "Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion" Mathematics 10, no. 20: 3794. https://doi.org/10.3390/math10203794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Spatial-Temporal Neural Network for Dense Non-Rigid Structure from Motion

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Preliminary

3.2. Deep Spatial-Temporal NRSfM

3.2.1. Shape Estimator in Reconstructor

3.2.2. Rotation Estimation Network in Reconstructor

3.3. Temporal Constraint and Spatial Constraint

3.4. Alternative Training

4. Experiments

4.1. Implementation Details

4.2. Network Structure Analysis

4.3. Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI