In this section, we first present our approach to modeling identities under noisy environments and a method for constructing stable representations; additionally, we describe the sparse dictionary learning structure for feature selection. Finally, we present our overall framework for effective jointly learning discriminative representations.
3.1. The Stochastic Process Model on Point Cloud Faces
Let one identity be denoted as
${X}_{i}$, with
i ranging through the different identities; we consider point cloud faces as realizations of a modeled noisy observation process as follows:
where each sample only provides spatial coordinates and can be expressed as a vector
${x}_{i}=\{{\mathrm{r}}_{k}\in {\mathbb{R}}^{3}:k=1\dots K\}$, where
K is usually in 10k magnitudes. The
${L}_{\theta}$ is a function that models the above geometric transformation, illumination, and occlusion variances. In addition, the
$\theta $ can be seen as a lowdimension random vector encoding the global illumination and rigid affine transformations [
33], though in general, the ergodicity of
${L}_{\theta}$ is hard to be satisfied.
As a possible solution, a graphbased method [
14] applied a dynamic approximation procession to transform raw point clouds into uniform point/vertex sequences with lengths of thousands; then, it was used to compute the corresponding embedded features in order to imitate a universal characteristic representation.
This method shed light on defining signals with nearindependent distributions between global and local variables; however, it required heavy training to realize the asymptotic stability, which is neither available nor necessary in modeling more consistent structures, e.g., faces, where geometric deformation and/or illumination/pose variances will not lead to large universal interferences.
Upon these observations, we built Euclidean lattices and learned the spatially aware features using solid scattering transformbased local operators; by applying subsequent sparse dictionary learning in the scattering domain, the uncorrelated signal components of the identity in question were jointly learned and inhibited.
The local 3D lattice operator: We used
${p}_{0}$ to denote the centroid of
${x}_{i}$, which was easy to obtain, and from
${p}_{0}$ we constructed the 3D global coordination,
${M}_{xyz}$. The succeeding step used farthest point (FP) [
10] sampling on
${x}_{i}$; note that we only needed to draw the countable
$C<200$ points as query points,
${\mathrm{P}}_{0}={\left\{{p}_{c}\right\}}_{c\u2a7dC}$, for the subsequent spatial thresholding nearestneighbor searching. From each
${p}_{c}$, we drew the N nearest points from their ambient spaces to form a leaf subset:
where we picked a threshold radius—
${R}_{c}=min\u2225{p}_{c}{P}_{0}/\left\{{p}_{c}\right\}\u2225$—to dynamically assure coverage. Furthermore, within each ambient space, we associated a 3D local lattice coordination
${\mu}_{xyz}\subset {M}_{xyz}$ and defined the overall density estimation function as the concatenation of C local areas:
where each
${\widehat{\rho}}_{c}$ was parameterized by
which is a sum of the Gaussian densities centered on each
${r}_{c,n}$. This spatial construction sliced each
${x}_{i}$ into
C local receptive fields; we adjusted the width parameter,
$\sigma $, of the Gaussian equivalent to the distance from the nearest alternative entry point, i.e.,
${\sigma}_{c}\to sup\left(\u2225{r}_{c}{{r}_{c}}^{\prime}\u2225\right)\forall r\in {N}_{c}$. By renominating each
${\rho}_{c}$ with indicator function
${\mathrm{I}}_{{\rho}_{c}}$, a raw point was transformed into a naive Borel set, which had a uniform probability measure, so we defined the global piecewise density function as
${\rho}_{\Omega}=\bigcup {\rho}_{c\u2a7dC}$.
This approach encoded a raw point cloud face into a more regular continuous probability density representation, with local fields being invariant to the permutations of the input order; each characteristic vector also had a corresponding length, which enabled the windowed operations (See
Figure 4 for illustration).
However, the above isometric deformations not only broke the order consistency but also gave rise to mixed deformations and polluted the geometric features; therefore, we needed to add a stabilizer to obtain the rotation and translation invariances.
To illustrate the operation, a piece of pseudocode is given below as Algorithm 1:
Algorithm 1 The Local Lattice Operation: 
 Require:
${p}_{0}=({x}_{0},{y}_{0},{z}_{0})$ (the centroid of a raw face scan ${x}_{i}$), ${x}_{i}=\{{\mathrm{r}}_{k}\in {\mathbb{R}}^{3}:k=1\dots K\}$  1:
Set${p}_{0}$ as the initial point of farthest point sampling and  2:
DrawC points ${\left\{{p}_{c}\right\}}_{c\u2a7dC}$ from ${x}_{i}$  3:
for${p}_{c}$ in ${\left\{{p}_{c}\right\}}_{c\u2a7dC}$ do  4:
Set ${p}_{c}$ as the origin, and  5:
Compute ${P}_{c}=KNN({p}_{c},N)$  6:
Compute local lattice $\mu =\left\{m\xb7dx,n\xb7dy,o\xb7dz\right\}$  7:
Compute local density estimation ${\widehat{\rho}}_{c}\leftarrow {\displaystyle \sum _{n=1}^{N}}G(\mu {r}_{c,n})$  8:
end for  9:
Concatenate local densities to form the overall function as ${\widehat{\rho}}_{\Omega}\left(\mu \right)=({\widehat{\rho}}_{1},\dots ,{\widehat{\rho}}_{c},\dots ,{\widehat{\rho}}_{C})$  10:
return${\widehat{\rho}}_{\Omega}$

Note that since the direction of local lattice
$\mu $ was exactly covariant to the global coordination of the scanned face, the aboveobtained local density feature actually exposed itself to a risk of being sensitive to rigid rotation, as well as to the order permutation of the grid position points (see
Figure 5 for an illustration). Therefore, we needed to construct a stabilizer to eliminate such isometry.
Windowed solid harmonic wavelet scattering: A scattering transform is a geometric deep learning framework that replaces learningbased crosscorrelation kernels with predefined filter banks. Induced stability for multiple species of isometrics and translation invariances can be prescribed with a groupinvariant structure built from a deliberately configured wavelet filter bank [
19,
20]. For 2D signals (e.g., images), the constitutive substructure in a scattering network comprises the wavelet filters with zero integrals and yields fast decay along
$\u2225\mu \u2225$; each can be parameterized by a rotation parameter,
$\theta $, and dilation parameter,
j, as
where
$r\in G$ belongs to a finiterotation group of
${\mathbb{R}}^{d}$.
For 3D signals—as in 3D face recognition with point cloud samples—3D rotation invariance is crucial since the random pose variation may provide an alias for the local density feature obtained by our local lattice operator (see
Figure 5). Accordingly, we built a stabilizer in the solid harmonic scattering approach from [
8], whereby solving the Laplacian equation with the 3D spherical coordinates and replacing the exponent term in the spherical harmonic function,
${Y}_{\ell}^{m}$, the solid harmonic wavelet can be expressed as follows:
In addition, by summing up the energies over
m, a 3D covariant modulus operator can be defined as
In short, a solid harmonic scattering transform is defined as the operation of summing up the above modulus coefficients over
$\mu $ to produce translation/rotationinvariant representation within each local field (see
Figure 6 for illustration). Furthermore, by raising
$Ux[j,\ell ]\rho \left(\mu \right)$ to exponent q and then subsampling
$\mu $ at
${2}^{j\alpha}$ with an oversampling factor—
$\alpha =1$ to avoid aliasing, the firstorder solid scattering coefficients are
Then, by iterating subsampling at intervals
${2}^{{j}_{2}\alpha}$ with
${j}_{2}>{j}_{1}$ and recomputing the scattering coefficient on the firstorder output, we obtained the following secondorder scattering transform:
These representations can hold local invariant spatial information up to a predefined scale,
${2}^{J}$; in our case, this was adjusted to be equivalent to the local threshold diameter,
${N}_{{R}_{c}}$. Furthermore, we needed to extend this operation to a universal representation. Here, we defined the windowed first and second solid harmonic wavelet scattering as follows:
Figure 6.
Illustration of the localized solid harmonic scattering transformation; the dashed blocks represent the extracted invariant representations from each local ambient space.
Figure 6.
Illustration of the localized solid harmonic scattering transformation; the dashed blocks represent the extracted invariant representations from each local ambient space.
For a better illustration, a brief pseudocode is stated below as Algorithm 2:
Algorithm 2 Windowed Solid Harmonic Wavelet Scattering 
 Require:
${\widehat{\rho}}_{\Omega}$ (local density features), J (scale parameter), L (rotation phase parameter), Q (exponential parameter)  1:
Set wavelet t ${\psi}_{\ell ,m}(r,\theta ,\phi )$ according to a predefined parameter $(J,L)$ with Equation [ 6]  2:
for${\rho}_{c}$ in ${\left\{{\rho}_{c}\right\}}_{c\u2a7dC}$ do  3:
for $0\u2a7dj\u2a7dJ$ do  4:
Compute the dilated modulus operation on scattering convolution ${\rho}_{c}\U0001f7c9{\psi}_{\ell ,m,j}\left(\mu \right)$, e.g., Equation [ 7]  5:
end for  6:
Compute the firstorder coefficients $S[{j}_{1},\ell ,q]\rho $ as Equation [ 8]  7:
Compute the secondorder coefficients $S[{j}_{1},{j}_{2},\ell ,q]\rho $ as Equation [ 9]  8:
Concatenate first and second coefficients as the local invariant representation ${S}_{{\rho}_{c}}$  9:
end for  10:
Concatenate ${\left\{{S}_{{\rho}_{c}}\right\}}_{c\u2a7dC}$ as the global invariant representation ${S}_{{\rho}_{\Omega}}$  11:
return ${S}_{{\rho}_{\Omega}}$

3.2. PieceWise Smoothed Solid Harmonic Scattering Coefficient Representation
The above strategy makes the representation stable to local deformations, and since face point clouds share a largely consistent global structure, it allows us to represent them even if no effective global embedding exists.
To balance the computation complexity and resolution in our experiments, we chose
$C=128$,
$J=7$, and
$q\in Q=\{1/2,1,2,3\}$; here, the above windowed operation and scattering coefficients were implemented with the Kymatio software package [
34]. To simplify our notation, we wrote the scattering representation in shorthand as follows:
where
p is the union of the first and second indices
$\{({j}_{1},\ell ,q)$ and
$({j}_{1},{j}_{2},\ell ,q)\}$, respectively, and the overall scattering coefficients of a point cloud face are
To give an illustration of this representation, we mapped the firstorder scattering coefficients of two identities (bs001 and bs070) onto the scattering indices shown in
Figure 7; identity bs001 from Bosphorus had two realizations and we could see that, although there was a significant visual difference between them, their scattering coefficients had a similar appearance.
To enhance the discrimination of this representation—and inspired by [
7]—in the next section, we construct a “facial scattering coefficients dictionary” from the above representation to associate multiscale properties for 3D facial recognition. Specifically, based on the good results from the 2D scattering coefficients in [
9], we follow their idea for utilizing supervised dictionary learning to select the most relative classification features from the 3D scattering coefficients.
3.3. Constructing a Local Dictionary with SemiSupervised Sparse Coding on Scattering Coefficients
The scattering representation brings desired properties including Lipschitz continuity to translations and deformations [
19]; however, the overall structure constructed using FPS and local nearest searching and normalization assumes uniform energy distribution among the realizations of each identity. In real scan scenarios, this assumption can be damaged when permutations, e.g., occlusion/rigid overall rotation, break the integrity of the face samples. In some severe cases, a certain portion of points are missing from face samples in Bosphorus. To reduce this category of intraclass variance, we imposed the homotopy dictionary learning framework presented by [
9] to build a local coefficient dictionary. Then, we trained the network to select the most relative classification features from the scattering coefficients.
Supervised Sparse Dictionary Learning: The idea of selecting the sparse combination of functions from redundant/overcomplete dictionaries to match the random signal structure was presented by [
35] and flourishes in multiple fields related to signal processing. Supervised dictionary learning was first presented by [
36] to solve the following optimization problem:
where
$\Theta $ indicates a simple classifier’s parameters;
ℓ is the loss function for computing the penalization on the prediction
$\left({x}_{j},{y}_{j}\right)$;
${\alpha}^{*}$ is the sparse code of the input signal,
${x}_{j}$, with the learned dictionary,
D.
In our problem, the input signal,
$S{\rho}_{\Omega}$, had a union form; hence, we constructed a global dictionary with structured local dictionaries defined as:
where
${\left\{{D}_{c}\right\}}_{c=1}^{C}$ are C subdictionaries with a certain structure—
$D\in {\mathbb{R}}^{K\times C\times N}$. Here, K indicates the length of the local pseudo coordination,
p, of
$S{\rho}_{\Omega}$; the aim was to represent
B input samples (
B—batch size) as linear combinations of
D’s elements. Each
${D}_{c}$ had
$N=512$ normalized atoms/columns—
${\left\{{d}_{n}\right\}}_{n=1}^{N}\in {\mathbb{R}}^{K}$. Then, the sparse approximation optimization was used to solve
where
${\alpha}_{i}$ is the concatenated sparse codes
${\alpha}_{i}=\left[{\alpha}_{i,1},...{\alpha}_{i,c}...,{\alpha}_{i,C}\right]$. Suppose the optimized sparse coefficient matrix is
$A\in {\mathbb{R}}^{K\times N\times C\times B}$ for a batch of input signals,
${\left\{S{\rho}_{i}\right\}}_{i=1}^{B}\in {\mathbb{R}}^{K\times C\times B}$, where each subdictionary has a local code
${\alpha}_{c}\in {\mathbb{R}}^{N\times B}$.
Expected Scattering Operation: Since we regrouped the raw point clouds and individually computed the invariant representations, $S{\rho}_{c}$, the windowed representation also had a nonexpansive property; within each local field, the translation converged to being negligible by taking $J\to \infty $.
In practice, this will possibly bring ambiguity. By setting a small C, each field becomes too large and results in the loss of higherfrequency components. Yet, for a larger C, the computing complexity amounts to $O\left(CS\right)$, and the optimization of such a concatenation will lead to supernumerary consideration, e.g., vanishing gradients.
Thanks to the integral normalized scattering transform [
20], which preserves a uniform norm by utilizing the nonexpansive operator
${\overline{S}}_{{}_{C}}$, we considered our question of structured learning for some random processes using the supposed condition. For the underlying distributions of point cloud faces yet to be established in practice, we focused on finding a solution with the above intuition and incautiously assumed our representation to be a stationary process up to negligible higher components; thus, the metric among (a batch of) spatial realizations reduced to a summation of the meansquare distances is
This definition is simple but effective as a regression term with a forward–backward approximation, which is based on an operation called proximal projection [
37].
where it encloses a solution with a forward step by computing a reconstructed
$\tilde{S}\rho $ and a backward step by putting it back into the proximal projection operator, updating
$\lambda $ and
D. Since our aim was to implement an efficient classification model, the sparse code should be able to preserve the principle components of the input signal; additionally, with the experimental observation of the point cloud faces’ solid scattering coefficients, we saw most energy being carried by its rare lowerfrequency components and characterized by largermagnitude coefficients; therefore, we picked the recent generalized ISTC algorithm [
9], which adopts an auxiliary dictionary to accelerate convergence. Here, the ReLU function acts as a positive soft thresholding implementation of proximal projection. Then, the optimization can be reached in an unsupervised
$n\u2a7dN$iterationupdating scheme, expressed as follows:
The overall architecture is shown in
Figure 8.
To effectively demonstrate our methods, a pseudocode is given in Algorithm 3:
Algorithm 3 Dictionary learning on local facial coefficients 
 Require:
$\left\{\right(y,x\left)\right\}$ (training set), D (initial dictionary), ${\lambda}_{1}$ (initial Lagrange multiplier, e.g., thresholding bias), $\theta $ (classifier parameter), N (number of iterations), $\tau $ (learning rate), v (regulation parameter)  1:
Draw a batch of samples ${\left\{({y}_{j},{x}_{j})\right\}}_{j\u2a7dBatchsize}$ and compute the scattering coefficients using the methods in Section 3.2; denote the coefficient vectors as ${\left\{{\beta}_{j}\right\}}_{j\u2a7dBatchsize}={\left\{{S}_{{\rho}_{\Omega}}\right\}}_{j\u2a7dBatchsize}$  2:
for$1\u2a7dj\u2a7dBatchsize$do  3:
for $1\u2a7dn\u2a7dN$ do  4:
Compute ${\alpha}_{n}={\alpha}_{n1}+{D}^{T}({\beta}_{i}D{\alpha}_{n1}){\lambda}_{n}$ where ${\alpha}_{0}=0$  5:
Compute ${\alpha}_{n}^{\U0001f7c9}=ReL{U}_{{\lambda}_{n}}\left({\alpha}_{n}\right)$  6:
Update ${\lambda}_{n}={\lambda}_{max}{\left(\frac{{\lambda}_{max}}{{\lambda}_{\U0001f7c9}}\right)}^{n/N}$  7:
end for  8:
Compute ${\lambda}_{N}={\lambda}_{\U0001f7c9}$ and ${\alpha}_{N}=ReL{U}_{{\lambda}_{N}}\left({\alpha}_{N1}\right)$  9:
end for  10:
Compute the classification loss ${\sum}_{j}^{}Loss(D,{\lambda}_{N},\theta ,{\beta}_{j},{y}_{j})$  11:
Update the parameters by a projected gradient step [ 36]  12:
$\theta \leftarrow {\textstyle {\prod}_{\theta}^{}}\left[\theta \tau ({\nabla}_{\theta}Loss(D,{\lambda}_{N},\theta ,{\beta}_{j},{y}_{j})+v\theta \right]$,  13:
$D\leftarrow {\textstyle {\prod}_{D}^{}}\left[D\tau ({W}^{T}D{\alpha}_{N}+\Delta (S\rho DA)\right]$  14:
return Learned Dictionary D
