1. Introduction
Three-dimensional (3D) face recognition (FR) has been studied for several decades with a wide variety of methods proposed [
1,
2,
3,
4,
5]. It is believed that 3D face data have intrinsic advantages over 2D face images in detecting presentation attacks and in providing additional discriminative features for FR [
6,
7]. Yet, 3D FR had not gained popularity in real-world applications until Apple Inc. released its iPhone X [
8] with TrueDepth camera and Face ID in 2017. One reason is due to the fact that the scanners used for acquiring 3D face in previous studies are often bulky and expensive, so they are thus not feasible in practical scenarios, though previous studies [
1,
2,
3] obtained very high recognition accuracy by using the captured high-quality 3D face data (see
Table 1). Here we categorize these methods into high-quality depth-based FR.
The emergence of low-cost RGB-D sensors, such as Kinect [
9] and RealSense [
10], makes it possible to capture 3D faces more efficiently and more cost-effectively. Many attempts [
11,
12,
13,
14,
15,
16] have been made in recent years to develop practical FR systems based on RGB-D sensors. As shown in
Table 1, in some RGB-D FR scenarios, with depth images as auxiliary information, researchers [
13,
15] show that FR accuracy can be improved compared with using only RGB images. However, the accuracy achieved by using depth images captured by low-cost RGB-D sensors [
11,
12,
13,
15] is still much lower than that by using 3D faces captured by 3D scanners [
1,
3]. This should be attributed to that the quality of the depth images captured by low-cost RGB-D sensors is generally poor, and we call such data as low-quality depth data (see
Figure 1). In contrast to the aforementioned high-quality depth-based FR, we categorized these methods into low-quality depth-based FR. In
Figure 1, depth images captured by different sensors are shown, which naturally cause the difference between two kinds of data on resolution and precision. Here, resolution (also known as density) refers to the density of the 3D face point clouds, defined by the number of points used to represent the 3D faces. Higher resolutions mean more details captured on the 3D faces. Precision refers to the minimum measurement error of depth values in unit of millimeters. Thus, smaller values indicate higher precisions.
In the past, regarding low-quality depth-based FR, it was usually an auxiliary of 2D FR. Most researchers focused on how to design a feature extractor or network to gain discriminative feature different from the color images, while very few works care about how the data quality and feature representation of such low-quality data can be enhanced for improving FR accuracy. This is due to two reasons: (1) A database containing both high- and low-quality depth data is lacked; (2) A reasonable and quantitative evaluation on how depth data quality influence the FR performance is underexplored. Please note that essentially, the former will astrict the latter. Therefore, the purpose of our work is to solve the two problems and then propose strategies to boost the performance of the low-quality depth-based FR by improving data quality and feature representation.
In the paper, to solve the data limitation, we extend our Multi-Dim [
21] to a large-scale face database called Extended-Multi-Dim database, which consists of: (1) Subjects’ color images, (2) the corresponding low-quality depth images captured by RealSense, and (3) the corresponding high-quality 3D shapes captured by a 3D scanner. The data is captured under varying pose, illumination and expression. We believe that the advent of such a database could boost the research on not only depth-based FR but also other face-related tasks including RGB-D face recognition, 3D face reconstruction and so on. The details about the database will be introduced in
Section 3: Extended-Multi-Dim.
Before this work, we did a related evaluation work, which was accepted on CVPR 2019 Biometrics Workshop [
22]. In [
22], we delved into how depth data quality influences depth-based face recognition and especially two aspects are focused on: precision and resolution. We conducted evaluation on generated high-quality depth images from existing datasets including FRGC V2 [
18], BU3D-FE [
19], Lock3DFace [
20], RGBD-W [
14], as well as part data of the Extended-Multi-Dim database which we introduce in this paper. Several significant observations were obtained in [
22], demonstrating that precision and resolution are indeed two important factors influencing the recognition accuracy of depth-based FR. In contrast, motivated by the observations in [
22], this paper further investigates how to improve the quality of low-quality depth data and identity feature representation with the assistance of high-quality data.
As previously mentioned, with the extended database and the activation of reasonable evaluation, we can focus on how to improve the quality of low-quality depth data, which should cause an improvement on performance of depth-based FR. Here, rather than enhancing data quality through data preprocessing as in [
23,
24], we expect to extract more discriminative identity feature from low-quality depth face images with the models which are guided by some constraints from high-quality depth data in training phase. This is because the former enhances data quality visually without definitely preserving necessary identity information. In contrast, we focus on how to use the guidance of high-quality data to train a better model for low-quality depth-based FR. In this paper, three strategies are proposed where the high-quality depth data participants and guides the training of low-quality depth-based FR models: image-based strategy, feature-based strategy and fusion of the former two.
The image-based strategy can be formulated as Equation (1), where
and
represent the pairs of low and high-quality depth images of the same person,
represents a low-quality depth-based extractor,
represents the image generator whose input and output are identity feature of a low-quality image and a produced high-quality image,
is an extractor for generated or true high-quality image. In this scheme, it is the high-quality data images that guide the low-quality depth-based models training. The Equation (2) can formulate the feature-based strategy, where
represents an identity feature extractor for high-quality depth images, and the meanings of the other indicators are the same with the Equation (1). In this scheme, it is the identity feature of high-quality data images that guides the models training. Finally, the fusion strategy means that both high-quality depth image and its corresponding identity features guide to train a low-quality depth-based FR model. The specific proposed methods of the three strategies will be introduced in
Section 4.
To sum up, the contributions of this paper are summarized as follows:
- (1)
We present a large-scale and multi-modality database Extended-Multi-Dim for FR. It has 902 objects which is the largest public RGB-D database, with the high-quality 3D depth data.
- (2)
We adopt a series of preprocessing methods for the collected databases including labeling 51 landmarks of 3D shapes and labeling 5 landmarks for RGB-D images.
- (3)
We design a standard experimental protocol for the collected database. Motivated by some conclusions of previous evaluation, we propose some methods based on three strategies to use the information of high-quality depth data to train a better network for low-quality depth-based FR. The results can be as the benchmarks for other researchers.
The rest of this paper is organized as follows.
Section 2 introduces some related works including some public databases and approaches.
Section 3 introduces in detail the Extended-Multi-Dim database.
Section 4 presents the details of the proposed methods based on three strategies.
Section 5 shows the experimental results of the approaches and corresponding analysis about low-quality depth-based FR.
Section 6 will conclude the work.
3. Extended-Multi-Dim Database
As aforementioned, there is no large-scale public database containing both high and low-quality depth data, which limits the development of depth-based FR, so we extended a multi-modality face database based on Multi-Dim database, namely the Extended-Multi-Dim database. To the best our knowledge, the database is currently the first public database with color and corresponding depth images captured by RealSense and high-quality 3D face shapes scanned by high-quality 3D scanner. Another motivation in creating this database is to solve cross-modality FR, and it is the largest database for RGB-D FR, which consists of 902 objects. Next, we will in detail introduce the proposed database from acquisition details, data process and statistics.
3.1. Acquisition Details
When capturing RGB and low-quality depth data, the Intel RealSense SR300 was used, and low-quality 3D faces captured by it have a resolution of 45K and a precision of
mm. Rather than released SDK, we used the tools of a dynamic link library called librealsense [
31] to capture RGB and depth videos simultaneously. The RealSense recorded the objects’ videos, and with the librealsense, the videos could be parsed into images when capturing. To align the color faces and their corresponding depth faces, the capturing speed is 22 frames per second, and the resolution of all the images is
. SCU 3D scanner [
17] was used to scanning 3D faces, and the 3D faces captured by it have a resolution of 100K and a precision of
mm. The diagram of data acquisition procedure is shown in
Figure 2, where it shows how to record the multi-modal data via Intel RealSence SR300 camera and SCU scanner. The extended database has two versions, which were captured in two different places. The version I consists of 228 subjects, and the Version II has 705 subjects. There are 31 subjects overlapping between the two versions. In previous work [
22], when we evaluated how the depth quality influences the depth-based FR, we first extended the Multi-Dim to 228 subjects and captured RGB-D data with RealSense covering three expression variations and yaw direction pose variation. Later, in Version II we further expanded complexity of the pose and expression variations and enlarged the scale of the data set to better simulate a real scene.
To comprehensively evaluate FR methods, especially to simulate complex conditions in the real world, when capturing RGB-D data, volunteers were required to present different expressions, poses under different illumination conditions forming four categories of frontal neutral, expression, pose and illumination. The four parts are introduced in detail respectively in the following:
- (1)
The illumination variations are shown in
Table 2.
- (2)
The volunteers were scanned in the frontal pose without any expression (referred to as NU for short) for a few seconds in both versions.
- (3)
The subjects were asked to rotate their heads in yaw direction by to (referred to as P1 for short) in version I. Apart from these actions, subject’s head was clockwise around the inverse (referred to as P2 for short) in Version II.
- (4)
In version I, the participants were asked to perform neutral, happy and surprise expressions in the frontal pose, while in Version II, eyebrow lifting, eyes closing, mouth opening, nose wrinkling and teeth barring were asked to be done by volunteers (referred to as FE for short).
When scanning 3D shapes, the performers only sit still about
m from the 3D camera under natural light (all lamps are off), no actions were needed.
Table 3 displays the overall base information on the database and
Figure 3 shows some visual examples.
3.2. Data Processing
After the original data are collected, we took some measures to process the data including labeling landmarks, images aligning for FR or other face-based tasks. Regarding the RGB-D data, face and landmarks are hard to be detected with depth images by some open source methods such as MTCNN [
32], therefore their aligned color images were either automatically detected by using MTCNN or manually marked (if MTCNN fails). When dealing with 3D shapes, first, we used a commercial application called Geomagic Studio [
33] to crop face region manually, then with an open source tool CloudCompare [
34], we marked manually 51 landmarks of the cropped shapes whose resolution is between 38K and 89K. Then, we used the 5 landmarks of left and right eye centers, left and right mouth corners, and nose tip of low-quality depth images and corresponding five 3D landmarks to compute the transfer matrix, with which the cropped shapes can be rotated to the requested location. Finally, these rotated faces were projected to 2D planes via weak perspective projection, resulting in high-quality depth images aligned to the low-quality depth images, which created pairs of different quality depth data for training FR models later. The
Figure 4 shows the procedure how to use original shapes and low-quality depth images with five landmarks to generate corresponding high-quality depth images. Meanwhile, the
Figure 3 shows some examples of aligned high- and low-quality depth images with different pose variations.
3.3. Statistics and Protocol
For other researchers expediently using the database and compare the performance, we design a standard experimental protocol for the collected database.
Table 3 presents the main statistics of the Extended-Multi-Dim database. In the paper, we focus on how different depth data quality influences the depth-based FR performance and how to improve identification rate by enhancing the depth data quality. Therefore, the whole database can be divided two parts: Training set and Testing set. The former includes pairs of depth images for training FR models, while the latter is for identification (1 to N) FR task, so the Testing set consists of Gallery and Probe. We also care about how depth quality effects the FR performance under different external challenges including pose and face expression variation, so the probe can be divided into four categories: NU, PS1, PS2 and FE. Details are shown below:
- (1)
Training set: The training data are all from Version II, and except for 31 subjects with Version I, Version II has 674 subjects. We randomly select 430 subjects of the 674 subjects as training sets. In training models, after shuffling training images, the first 20% images are separated into validation sets.
- (2)
The Testing set are divided into A and B parts, where the remaining 275 subjects in Version II make of the Testing set A and the all data in Version I make of the Testing set B. In Sec V, in different experiments, the specific dividing of galleries and probes can be displayed.
- (3)
Resolved from original videos and face cropping, there are about 299K, 80K, 318K frames in total for training, validation, and Testing sets, respectively. Owing to the huge amount of data and especially the similarity in joint images, when testing, we select one frame out of every 10 frames in Test set A and every 6 frames in Test set B.
All subjects are Chinese people, and the information of gender statistics are that the ratio of female is 28.1% (64 of 228) while the ratio of male is 72.9% (164 of 228) in version I and that the ratio of female is 43.3% (305 of 705) while the ratio of male is 56.7% (400 of 705) in Version II. In addition, due to the database collecting in the campus, the age of all the subjects is range from 18 to 24 years old.
4. Proposed Approaches
The purposes of the work are further to analyze the influence of depth data quality for depth-based FR based on the previous work and meanwhile to improve recognition performance by enhancing the data quality and feature representation. Therefore, we propose three strategies including image-based, feature-based and fusion-based. With the guidance of high-quality of data, we can transfer some knowledge for training a better low-quality depth-based FR model. In this section, we first show our proposed method based on different strategies in detail, and then introduce the backbone models used in the methods.
4.1. Image-Based Boosting Strategy
The
Figure 5 shows the workflow of the image-based boosting approach. The base purpose of the strategy is to access the identity feature (ID Feature) of low-quality depth image
through a feature extractor
, which is a convolution network parameterized by
. Generally, ID Feature, the output of
, is usually used for classification task with the cross-entropy loss
.
To make the ID Feature more discriminative, we simply think that we generate a fake depth face image
with this ID Feature. If the more similar the produced image and the corresponding high-quality image are, the more discriminative the ID Feature is. So, when training, a generator
is adopted to ID Feature for production. The generator is a deconvolution network
to generate a fake image that is parameterized by
with a constraint
. Also, with the experience from [
35], we add a random noise with identity feature to
, and the noise models facial appearance variations other than identity or data quality.
In addition, we think that if the generated fake images also preserve identity information as corresponding ground truth , the probability distribution of ID Feature is further similar to the one of the high-quality image and the feature is more discriminative. Therefore we conduct two measures: (1) as Equation (3) shows after generating the images, we used the pretrained high-quality depth-based models to extract the identity feature of pairs of and , then used loss as a constraint to make the two features similar; (2) from Equation (4), we straightly add another extractor and another cross-entropy loss after .
The network’s parameters
,
or
are optimized by minimizing the aforementioned synthesis loss
,
and
or
. For a Training set with
N training pairs of
, the optimization problem can be formulated as follows:
or
where
s are weighting parameters,
is defined as L1 loss that jointly constrains a produced image to similar to the high-quality one, and superscript
L,
H,
F represents the low or high-quality images or fake produced images.
is Euclidean distance loss (L2 loss). We will postpone the detailed description of all the individual loss functions in
Section 5.1.
4.2. Feature-Based Boosting Strategy
Figure 6 shows the workflow of the feature-based boosting approach. The base purpose of the strategy is same with the one of image-based boosting strategy. Furthermore, in this strategy, we aim to transfer the knowledge of high-quality depth-based extractor
to learn a corresponding low-quality extractor
. We expect that the
can extract the ID Feature with the similar probability distribution compared with the ones of high-quality images. In this part, inspired by some ideas from the transfer learning [
36], we directly and indirectly use constraints to make the two probability distributions similar.
Generally, the last two outputs of a deep FR model or classification model are the logits and ID Feature, and here let us denote the final score output as
Z. Here, the
and
have the same structure. The
is first pretrained base on high-quality data. When training the
, as shown in
Figure 6, the input is a pair of images
, and with the pretrained model, we transfer the knowledge by some losses.
For direct constraints, formulated by Equation (5), we recognize the features from two models as two distributions, and use multi-kernel maximum mean discrepancy (MK-MMD) loss which is often used in many transfer learning works [
37] to make the two features similar.
Regarding the indirect constraints, we adopt two methods: (1) formulated by Equation (6), we use MK-MMD loss on margin distribution (
Z), conditional distribution (
) of the two models as hint to guarantee the features similar; (2) based on feature space transformation, as Equation (7) shows, we transform the ID Feature from low-quality model to the high-quality feature space with a sample converter (
) which consists of two fully connected layers with ELU, then add a L2 loss
on two features. Finally, the parameters
is optimized by minimizing an overall loss
:
or
or
The Equationa (5)–(7) represent the losses for directly and indirectly constraints respectively, where
represents the ID Feature,
s are weighting parameters, the
is the feature converter and the subscript
L and
H represents the vectors from low or high-quality models. We will postpone the detailed description of all the individual loss functions in
Section 5.1. Here, MMD is widely used as a distribution distance to measure the discrepancy between two domains. It compares the distributions in the Reproducing Kernel Hilbert Space (RKHS) [
38]. The equation for MMD can be formulated as:
In the Equation (8),
is an explicit mapping function.
and
represent two samples from distributions of high- and low-quality models. Generally,
N and
M are the total numbers of samples, so in our experiments, they are same. By expanding Equation (8), the equation can be reformulated as:
From Equation (9), we can see that loss use kernel method to project the sample vectors into higher dimension. In our experiment, we choose the Gaussian RBF kernel, which is considered to be a universal approximator, with the kernel function as , where is the bandwidth.
4.3. Fusion-Based Boosting Strategy
In the part, the main idea is using information of both high-quality image and the feature to guide the low-quality depth-based models training simultaneously. Concretely, we combine some losses of image-based or feature-based methods in the strategy with a simple principle that this combination should improve the FR accuracy relatively obviously. Therefore, according to the results of image-based and feature-based methods, we select some ones with good performance and combine them together.
According to the results, and with a sample purpose that combine the outstanding methods from the two strategies to make the best performance, we finally decide to combine three groups in this part: (1) the methods represented by Equations (4) and (6); (2) The methods represented by Equation (3) and (6); (3) The methods represented by Equations (4) and (7). Here, before adding constraints for two identity features, the normalizations are adopted.
In all combinations, the feature extractor are shared for both image-based and feature-based boosting strategies to gain identity feature for matching, and other methods in image-based or feature-based boosting modules are fused to guide the to extract more discriminative feature.
4.4. Backbone Models
In our experiments, the base network has two functions: (1) The performance of the models trained directly severs as the baseline for the models trained based on another strategies, (2) This network structure will be as different part to be assembled for the overall structures of proposed methods.
Here, two deep face recognition models, CASIA-Net [
39] and Resnet [
40], are considered to be base networks. All are relatively light-weight models. This enables us not only to assemble overall structures together easily but also to train them from scratch by using relatively small data sets of facial depth images that are available to us. Therefore, we do not employ complex or very deep models such as VGG [
41] and GoogleNet [
42].
For CASIA-Net, motivated by [
35], we add batch normalization [
43] and exponential linear unit [
44] after each convolutional layer. The input image size is changed from
to
, and the 320-dimensional output of
layer is taken as the extracted feature.
For Resnet, we employ Resnet-18 as defined in [
40]. Its input image size is changed from
to
, and we also add batch normalization and exponential linear unit after each convolutional layer. Finally, the 512-dimensional output of
is taken as the extracted feature.
In the experiments, either of the two networks is used as feature extractor. Meanwhile the symmetric structure of CASIA-Net is employed as the generator in all image-based schemes.
Table 4 shows the specific structure of the networks. For all the deep models, cosine similarity is employed to measure the similarity between the extracted features of different facial depth images.