Next Article in Journal
A Study on the Annealing Ambient Effect on the Anti-Pollution Characteristics of Functional Film for PV Modules
Previous Article in Journal
An Orthogonal Type Two-Axis Lloyd’s Mirror for Holographic Fabrication of Two-Dimensional Planar Scale Gratings with Large Area
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Tightly-Coupled Data Compression for Efficient Face Alignment

1
College of Mechanical Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
2
Suzhou Key Laboratory of Precision and Efficient Machining Technology, Suzhou 215009, China
3
Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2018, 8(11), 2284; https://doi.org/10.3390/app8112284
Submission received: 12 September 2018 / Revised: 2 November 2018 / Accepted: 15 November 2018 / Published: 19 November 2018

Abstract

:

Featured Application

The proposed method in this paper is suitable for resource restricted environments such as mobile face alignment applications.

Abstract

Face alignment is the key component for applications such as face and expression recognition, face based AR (Augmented Reality), etc. Among all the algorithms, cascaded-regression based methods have become popular in recent years for their low computational costs and satisfactory performances in uncontrolled environments. However, the size of the trained model is large for cascaded-regression based methods, which makes it difficult to be applied in resource restricted scenarios such as applications on mobile phones. In this paper, a data compression method for the trained model of supervised descent method (SDM) is proposed. Firstly, according to the distribution of the model data estimated with the non-parametric method, a K-means based data quantization algorithm with probability density-aware initialization was proposed to efficiently quantize the model data. Then, a tightly-coupled SDM training algorithm was proposed so that the training process reduced the errors caused by data quantization. Quantitative experimental results proved that our proposed method compressed the trained model to less than 19% of its original size with very similar feature localization performance. The proposed method opens the gates to efficient mobile face alignment applications based on SDM.

Graphical Abstract

1. Introduction

Face alignment is an important part of facial image analysis. It automatically localizes facial feature points such as eyes, nose, eyebrows, mouth, etc. from a face image. It plays an important role in popular applications such as face recognition [1,2], attribute computing [3,4], and expression recognition [5]. Face alignment technology is usually applied to get some anchor points for affine warping so that the face recognition procedure is robust against pose variations. In [3], facial landmarks were used to outline a fetus’s face. In [4], facial landmarks helped to localize and represent salient regions of the face. Figure 1 shows an example of a face alignment algorithm with the supervised descent method (SDM) [6].
According to a recent survey [7], face alignment methods can be divided into two categories: generative methods and discriminative methods.
Generative methods explicitly construct generative models for the shape and/or appearance of the face. Feature locations are derived according to the best fit of the model to the test image. Cootes et al. proposed a famous algorithm that is called active shape models (ASM) to calculate an appearance model for every facial part separately [8]. In [9], the authors proposed the Gauss-Newton Deformable Part Model (GN-DPM) that can construct generative models for each facial part simultaneously. The classical active appearance models (AAM) [10], which contain the shape model, appearance model, and motion model, also belong to this category. The AAM’s defect is that this method is not robust against occlusions.
Different from generative methods, discriminative methods aim to estimate the mapping between the facial appearances and the features’ locations directly. Constrained local models (CLM) are able to learn independent local detector for each feature point [13]. Then, a shape model is used to regularize these local models. Different from CLM, cascaded regression methods learn a vectorial regression function directly to calculate the face shape stage-by-stage. Explicit shape regression (ESR) [14] was one of the first algorithms in this category. It is a two-stage boosted regression framework. Burgos-Artizzu et al. introduced occlusion information into the regression process [15] in order to improve robustness. Kazemi and Josephine proposed to use regression trees instead of random ferns and achieved super-fast speed [16]. Besides the above mentioned two-level boosted regression framework, Xiong and De la Torre presented a cascaded linear regression method with hand-crafted features [6]. The contribution of [6] is a provable supervised descent method (SDM). It also extends SDM to Global SDM in order to cope with the problem of conflicting gradient directions [17]. SDM is a popular face alignment method especially for resource restricted applications since it can achieve state-of-the-art results in real 2D scenarios and still achieve real-time performance.
With the advent of the deep learning era, deep neural networks have been successfully applied in many computer vision tasks in recent years. Sun et al. were the first to use the deep convolutional network cascade for face alignment [18]. Reference [19] proposed a recurrent neural networks approach. Recently, there were works to achieve 3D face alignment by fitting a 3D Morphable Model (3DMM) through convolutional neural networks (CNN) [20,21]. Reference [22] proposed a 3D face alignment network (3D-FAN) by stacking four hourglass networks. Reference [23] proposed a two-stage method with the help of a deep residual network [24]. In this method, heat-maps of 2D landmarks are first calculated using convolutional-part heat-map regression. Then, these heat-maps along with the original RGB image are used to regress the depth information with a very deep residual network. Although deep learning methods, especially 3D ones, perform better on images with large head poses than do traditional methods, they are not easily applied in mobile platforms. The main reasons are as follows: (1) The deep learning model is usually in the order of 100 MB, which is too big for mobile applications. (2) The computational cost is still quite high and can hardly achieve real-time performances on mobile platforms. (3) Deep learning model need huge amounts of training data, which are not easy to collect for the face alignment task. (4) The training process of deep learning models is tricky without the help of open source implementations from the authors.
As mentioned above, SDM achieves satisfactory results while having a relatively low computational cost. However, the size of the trained model for SDM can be more than 80 MB, which is still too large for a commercial mobile application. Unfortunately, traditional lossless compression technology such as entropy coding [25] cannot achieve a high enough compression rate to meet the needs of mobile applications. Lossy compression technology is widely used in video encoding areas. State-of-the-art methods such as HEVC (High Efficiency Video Coding) can reach a very high compression rate with good visual quality [26]. Unfortunately, this kind of technology heavily depends on the block motion estimations between consecutive frames in time domains, which are obviously unavailable in our work.
Recently, research about deep learning network compression has gradually emerged. Reference [27] proposed new kinds of convolutional operations to reduce parameters. Reference [28] compressed the network by pruning unimportant filters according to weight analysis. References [29,30] transferred the weights into binary values in order to reduce the size of the model. Instead of transferring the weights into binary values, Zhu et al. proposed a method to reduce the precision of weights into ternary values [31]. This method avoids most accuracy degradation. Howard et al. proposed depth-wise separable convolutions architecture so that traditional 3D convolutions can be broken into 2D convolutions [32]. This makes it suitable for mobile applications. Based on [32], shortcut technology was introduced in [33]. Furthermore, linear bottlenecks were used instead of Relu [34] in order to keep the features. Most of the above methods are aimed at specific tasks or network structures such as image classification and image segmentation. To the best of our knowledge, there is no compression architecture that can be applied directly to face alignment networks.
In this paper, a tightly-coupled data compression method for the trained model of supervised descent method (SDM) that can reduce the size to less than 1/5 of the original size without obvious performances loss is proposed. This method opens the gates to mobile applications using SDM based face alignment technology.
The remainder of the paper is organized as follows: Section 2 briefly describes the main procedure of the SDM algorithm [6] for the sake of completeness and clarity; Section 3 explains our proposed method in detail; Section 4 demonstrates some detailed algorithm implementations in the proposed method; Section 5 shows both qualitative and quantitative experimental results; Section 6 draws conclusions.

2. Basics of SDM

This section briefly introduces the main workflow of SDM. Please refer to [6] for details.
We assume that the face feature points are represented by N 2D landmarks s = [x1, y1, …, xN, yN]T. Usually, N = 68. Figure 2 shows the definition of the 68 face landmarks. Given a face image I and the initial 2D landmarks s0 estimated from the detected face region, our aim is to find a series of regressors:
R = r1 ∙∙∙ rD
where rd = {Ad, bd}(d = 1 ∙∙∙ D), Ad is the projection matrix that can also be called the descent direction and bd is the bias term. In SDM, D is usually chosen between 4 and 6, and the estimated 2D landmarks at dth step are calculated according to the following equation:
sd = sd−1 + Adf(I, sd−1) + bd
f(I, sd−1) are the shape related features that can be SIFT (Scale-Invariant Feature Transform)features [35] or HoG(Histograms of oriented Gradients) features [36] for better performance [37]. These features can calculated at the landmarks sd-1 in image I. The final estimated facial landmarks are sD.
Given M training images, SDM aims to minimize a series of the following equations and get rd sequentially:
arg min A d , b d i = 1 M Δ s d 1 i A d f ( I i , s d 1 i ) b d 2  
where Δ s d i is the shape residual of the ith training image at the dth regression step:
Δ s d i = s * i s d i  
s * i are the ground truth landmarks locations of the ith training image. Equation (2) is a standard linear least squares problem and can solved in closed-form.
The following Figure 3 shows the flow chart of the training algorithm of SDM.

3. Tightly-Coupled Data Compression Algorithm

As shown in Section 2, the key components of SDM are rd for all D steps. Thus, the trained model of SDM stores all the information of rd for all D steps. Since all the features calculated from every landmark are concatenated together and form a large feature vector, the dimension of the feature vector can easily achieve 27,200 if HoG features are used. So the dimension of Ad is 136 × 27,200 and bd is a 136-dimensional vector. Typically, each component in Ad and bd is represented by a single-precision floating point number. Table 1 shows the data ranges of Ad and bd for a typical HoG-based 6-step regressor.
From the table it can be concluded that the data ranges vary in different regression steps. So in this paper, the method of compressing the data separately for each step is proposed. Furthermore, the data range of Ad is very different from that of bd, and bd only contains 136 floating point numbers in each step, which is much smaller (only about 3KB for all 6 steps) than the size of Ad. Therefore, in this paper, only the data in Ad are compressed.
Since the data distribution of Ad can not be described by a parametric model, a non-parametric method for estimating the data distribution of Ad is applied in this paper. The number of elements in Ad is large enough, so for the sake of computational efficiency, the Parzen window method [39] is used here. Assuming that the number of elements in Ad is T, and window size is h, the probability density function (PDF) of the element x in Ad can be estimated through the following equation:
p ( x ) = 1 T i = 1 T 1 h ϕ ( x x i h )  
where ϕ ( x ) is the square window function
ϕ ( x ) = { 1 | x | 0.5 0 o t h e r w i s e  
Figure 4 shows the estimated PDF of the elements in A1. The shapes of the PDFs in other steps are similar. From the figure it can be concluded that most of the values in Ad concentrate around 0, and the distribution also shows some long tail effects. As a result, it is inappropriate to compress the data according to uniform quantization. In this paper, a K-means based data quantization algorithm with a probability density-aware initialization technique is proposed in order to cope with the above mentioned difficulties.

3.1. K-Means Based Data Quantization with Probability Density-Aware Initialization

The basic idea of our proposed data compression algorithm is to quantize all the elements of Ad into Q predefined values so that each element can be represented with fewer bits. The optimal Q predefined values can be calculated through minimizing the following equation:
E = k = 1 Q i = 1 T 1 ( A d ( i ) C k ) A d ( i ) q k 2 2  
where 1(∙) is the characteristic function, T is the number of elements in Ad, Ad(i) is the ith element in Ad, C = {C1, C2, …, CQ} divides the data of Ad into Q disjoint clusters, and qk is the representative scalar of cluster k. Solving the minimization problem of Equation (6) is NP-hard [40]. The K-means algorithm [41] can be applied to get the approximate solution. The performance of K-means algorithm heavily depends on the initialization. As shown in Figure 4, the data distribution of Ad shows single peak characteristics. Traditional random initialization for the K-means algorithm cannot capture the characteristics of the data distribution of Ad, so that the acquired result is far from optimal.
A probability density-aware initialization method, which is to say the initial quantization step size (cluster size) is inversely proportional to the probability density distribution of the data in order to fully capture the data distribution of Ad, is proposed in this paper. Thus, we have
v k 1 v k p ( x ) d x = c o n s t . = 1 Q ( k = 1 Q )  
where vk−1 and vk are lower and upper bounds of the kth quantization step respectively. So the optimal initialization strategy can be estimated through the following equation:
arg min v 1 v Q 1 k = 1 Q v k 1 v k p ( x ) d x 1 Q 2 s . t . v 0 = V min v Q = V max
where Vmin and Vmax are minimum and maximum values of Ad as shown in Table 1. Unfortunately, the exact solution of the above equation is difficult. In this paper, an approximate solution to the above problem is proposed as follows since only reasonable initializations for the K-means algorithm are needed.
All the elements in Ad are sorted in ascending order and stored in an array SA. The lowest and highest quantization step bounds can be calculated with the following equation:
v0 = SA(1), vQ = SA(T)
The remaining quantization step bounds are calculated with Equations (10) and (11) as follows:
i d x = k T Q  
vk = SA(idx) (k = 1 … Q − 1)
where is the floor function that rounds the element to the nearest integer towards minus infinity.
With the above algorithm, the number of elements between consecutive quantization steps bounds is nearly the same, so that Equation (8) is approximately solved. The most important thing is that all the quantization steps bounds can be efficiently estimated with the above algorithm. However, we do not choose the mid-value between the lower and upper bounds as initial values for the K-means algorithm since the data distribution inside the quantized region is not uniform. Instead, the initial value for the K-means algorithm is set as the mean value of all the elements that fall in the same quantized region as follows. First, the set of all the elements that belong to the kth quantization step is calculated.
U k = { A d ( i ) | v k 1 < A d ( i ) v k , 1 i T }  
Then, the initial value of the kth cluster center for the K-means algorithm can be calculated as follows:
μ k = 1 | U k | j = 1 | U k | U k ( j )  
With these initial values, the K-means algorithm [41] can be applied so that the optimal clusters’ centers qk (k = 1 ∙∙∙ Q) with regard to Equation (6) are estimated. Thus, the quantized value AQd(i) (i = 1 ∙∙∙ T) that corresponds to each element Ad(i) (i = 1 ∙∙∙ T) can be calculated with the following equations:
i d x i = arg min k { 1 , 2 , , Q } A d ( i ) q k 2  
A Q d ( i ) = q i d x i  

3.2. Tightly-Coupled Training Algorithm

If we directly quantize the final learned result of Ad as described in the previous section, the quantization process will introduce extra errors to the feature localization results. In order to reduce the errors introduced by data quantization processes, we modified the traditional training algorithm of SDM described in Section 2 and propose the following tightly-coupled training algorithm as shown in Figure 5.
In the above algorithm, the data quantization process is coupled with the training process as shown in steps 4 and 5 so that the errors caused by quantization in the previous step are propagated into the next regression step. As a result, the projection matrix in the next step can partially correct the errors introduced by the quantization process and the final results can be improved.

3.3. Compressed Model Data Storage Arrangement

The final compressed model consists of the stacked version of all D steps of regressors rd. Each regressor rd consists of three parts: Q quantized values for the projection matrix; the quantized projection matrix AQd that corresponds to Ad; the bias term bd.
The quantized values are stored with single-precision floating point numbers. There are Q single-precision floating point numbers for each regression step.
The quantized projection matrix AQd has the same dimension as that of Ad. Each element in AQd is an index to the quantized value, which was described above. Since there are Q different quantized values, each element in AQd only consists of log2Q bits that usually consume much fewer bits than 32-bit floating point numbers. Through AQd, the corresponding quantized value can be fetched according to the index, and the approximate projection matrix can be reconstructed. In such a way, the data compression purpose can be achieved. Throughout this paper, we chose Q = 64, which is justified in the experimental results section.
The bias term bd is stored directly as a floating point number since its size is relatively small, as stated before. Figure 6 illustrates the diagram of the compressed data storage arrangement.

4. Methodology

In this section, implementation details about the algorithm proposed in Section 3.1 are described. Algorithm 1 demonstrates the pseudo codes of the approximate algorithm for solving Equation (8).
Algorithm 1. Approximate algorithm for solving Equation (8).
Input: Projection matrix Ad at the dth step; number of elements T in Ad; number of quantization levels Q.
Output: quantization steps bounds v0 ∙∙∙ vQ
  • Sort all the elements of Ad in ascending order and store them in an array SA with quick sort [42].
  • Calculate v0 and vQ with Equation (9).
  • Fork = 1 to Q – 1,
  •  Calculate idx with Equation (10).
  •  Get vk with Equation (11).
  end for
Algorithm 2 shows the pseudo codes for the whole procedure of the proposed data quantization algorithm.
Algorithm 2. The proposed data quantization algorithm.
Input: Projection matrix Ad (d = 1 ∙∙∙ D) for all D steps; number of quantization levels Q.
Output: Quantized projection matrix AQd (d = 1 ∙∙∙ D) for all D steps.
For d = 1 to D,
  • Calculate the optimal quantization steps bounds v0 ∙∙∙ vQ according to Ad with Algorithm 1.
  • Initialize cluster centers μk (k = 1 ∙∙∙ Q) for the K-means algorithm with Equations (12) and (13).
  • Minimize Equation (6) with the K-means algorithm [41] and get the optimal clusters’ centers qk (k = 1 ∙∙∙ Q).
  • For each element Ad(i) (i = 1 ∙∙∙ T), its corresponding quantized value AQd(i) (i = 1 ∙∙∙ T) can be calculated with Equations (14) and (15).
end for

5. Results

In this section, our method was compared against the standard SDM and deep learning based method [22] on a 300 W dataset [12]. This dataset is a publically available challenging dataset that consists of 600 indoor and outdoor in-the-wild images. It covers a large variation of identity, expression, illumination conditions, pose, occlusion, and face size. Each image has ground truth locations of 68-point configuration [43]. The open source implementation of standard SDM by Patrik Huber [44] and the implementation of the authors [45] for [22] were used in this paper. Since the SDM based method does not apply any 3D information, only the 2D face alignment network (2D-FAN) version was tested in the deep learning based method [22] for the sake of fairness.

5.1. The Choice of Q

In this section, the face alignment accuracy was evaluated according to the average distance between the detected landmarks and the ground truth, normalized by the inter-ocular distance as proposed in [46]. The number of bits used for each element in AQd was varied and their corresponding normalized mean error loss against standard SDM algorithm [6] was calculated. Figure 7 shows the result.
From the above figure it can be concluded that if each element in AQd used more than 6 bits, the loss was small and reduced smoothly. However, if each element was represented with less than 6 bits, the error loss increased rapidly. In this paper we chose 6 bits, considering the balance between the error loss and compression efficiency. This means Q = 26 = 64.
In this paper, HoG features were utilized and 68 feature points were detected. The number of regression steps was 6. The dimensions of AQd and Ad were both 136 × 27,200, as mentioned before. The dimension of bias vector bd was 136. For standard SDM, all the elements were represented by 32-bit single precision floating point numbers. Therefore, the total space needed for the training model was
(136 × 27,200 + 136) × 4 × 6 = 88,784,064 Bytes
With our proposed method, we needed to store 64 single precision floating point numbers for quantized values, 136 × 27,200 6-bit matrix AQd and 136-dimensional single precision floating point vector bd for each regression step. Therefore, our storage consumption for the whole training data was
(64 × 4 + 136 × 27,200 × 6/8 + 136 × 4) × 6= 16,651,200 Bytes
Our model size was only about 18.75% of the standard one, which means the compression rate of our proposed method was about 5.3X. Furthermore, when entropy coding [25] was applied to our proposed compressed data, about 10% more compression power was usually obtained.

5.2. Qualitative Experimental Results

In this section our feature localization results were compared against the standard SDM with an uncompressed training model [6] and 2D-FAN [22]. Figure 8 shows the results on the 300 W dataset [12]. The left column shows the results with our compressed training model. The center column shows the results by using the SDM algorithm with an uncompressed training model [6]. The right column shows the results with the 2D-FAN [22]. Red lines in all the images depict ground truth feature locations. From this figure it can be concluded that our compressed training model can generate very similar results with its uncompressed counterpart. They can both fit the ground truth very well even in occlusion scenarios. Deep learning based method [22] performs slightly better, especially in images with large head poses such as the first and the last rows. However, the model size of the 2D-FAN is about 182 MB [45], which is obviously not suitable for mobile applications. The accuracy of our proposed method is enough for mobile applications such as a virtual add-on, which is verified in Section 5.6.

5.3. Quantitative Experimental Results

As shown in Figure 2, the face features can be divided into five parts: face contour, eyebrows, eyes, nose, and mouth. Face contour contains points No. 1 to No. 17. Eyebrows contain points No. 18 to No. 27. Eyes contain points No. 37 to No. 48. Nose contains points No. 28 to No. 36. Mouth contains points No. 49 to No. 68. The average values of the normalized mean errors in the five parts, respectively, on the 300 W test dataset were estimated. We compared our compressed training model against the uncompressed training model and deep learning based method. The results are shown in Figure 9. From the figure it can be concluded that our proposed compressed training model can achieve very close feature points compared with the uncompressed training model. The deep learning based method performs slightly better. However, the differences of the normalized mean errors on the 300 W test dataset between our proposed method and the deep learning based method are all below 1% for all five parts, which is acceptable considering the high computational cost and memory storage usage of the deep learning based method.
Similar to the work in [22], a subset was chosen from the 300 W test dataset whose yaw angles are between 0 and 30 degrees. Experiments on this subset with the above three methods were conducted. The results are shown in Figure 10. From this figure, we can find that for moderate head poses, which are the typical scenarios for mobile applications, all three methods can achieve lower normalized errors and generate very similar results, especially for eyes and face contour parts. These two parts are very important for AR (Augmented Reality) based applications. The differences of the normalized mean errors between our proposed method and deep learning based method were less than 0.6% for most parts and even achieved 0.3% for the eyes regions. This proves the effectiveness of our proposed algorithm.
Figure 11 shows the cumulative error distribution curve of our proposed method and the uncompressed training model SDM [6]. It is obvious that the two curves are very close to each other. This again confirms the similar performance of both methods despite our training model being much smaller.

5.4. Ablation Study

5.4.1. Effect of the Tightly-Coupled Training Algorithm

The effect of the tightly-coupled training algorithm was analyzed in this section. We compared the results of our proposed method with the method without a tightly-coupled training algorithm, which is to say the quantization results of Section 3.1 were applied directly. The results are shown in Figure 12. From the figure it can be concluded that without the tightly-coupled training algorithm, normalized mean error for each feature point increased by more than 2.5%. Considering that the average of the normalized mean error was about 3.5%, this was a great increase in localization error. This proves that the coupling process successfully reduces the errors caused by the data quantization step.

5.4.2. Effect of Probability Density-Aware Initialization

Investigation of the effectiveness of our proposed probability density-aware initialization technique in Section 3.1 was conducted in this section. Our proposed method was compared with standard random initialization for the K-means algorithm, which is to say we randomly chose Q elements from Ad and set them as the initial cluster centers for the K-means algorithm instead of μk as calculated according to Equation (13). We tried 100 times and calculated the averages and standard deviations of normalized mean errors for the five parts of the face. The results are shown in Table 2. From the table it can be found that the normalized mean errors increased by more than 1.3% when random initialization was used. The reason might be that the data in Ad concentrated at specific points, as shown in Figure 4. If we randomly selected Q elements and set them as the initial cluster centers, all these initial cluster centers fell near the mode of the PDF with high probability. Thus, the quantization errors for the elements in Ad, which are far from the mode of the PDF, were very high and caused large localization errors.

5.4.3. Effect of K-Means Clustering

Our proposed method was compared with the method without using the K-means algorithm described in Section 3.1. That is to say μk, which is calculated according to Equation (13), is used directly as the quantization center. Figure 13 shows the result. Without K-means clustering, normalized mean errors increased by about 0.8%. This proves K-means clustering successfully minimizes the quantization errors in Equation (6).

5.5. Parameter Sensitivity Analysis

There are two important parameters in our proposed method, one is the number of bits nb = log2Q used to encode each element in Ad. The other is the number of regression steps D. Table 3 shows the normalized mean error loss against the standard SDM for different choices of nb.
It can be concluded from the table that the normalized mean error loss against the standard SDM decreased almost linearly when the number of bits was not smaller than 6. The error loss difference between nb = 6 and nb = 16 was quite small and was hardly noticeable in video applications. The above data justified the choice of Q = 26 = 64 in all our experiments.
Experiments for the normalized mean error loss against the standard SDM with different choices of regression steps D were also conducted in this section. Table 4 shows the results.
This table verifies that the normalized mean error loss was not sensitive to the choice of the number of regression steps if D > 1. This justifies the robustness of our proposed algorithm. In reality, it is uncommon to choose very small D values unless the computing resources are extremely restricted since feature localization accuracy is not guaranteed even with the standard SDM. This table also reveals the fact that with our proposed method we are free to choose D according to the demand of the application because the accuracy loss is not sensitive to D.

5.6. User Study for AR Mobile Applications

An AR mobile application with SDM using our proposed compressed trained model was developed. Sample effects of this application are shown in Figure 14. This application adds some interesting virtual decorations to the face video in real-time.
Twenty short face videos of different people were recorded. The length of each video was about 30 s. AR effects for these face videos with our developed mobile application were generated. Each face video generated two output videos: one with standard SDM [6] and the other with our proposed compressed model. These two videos have the same AR effect, but different face videos had different AR effects.
Twenty people were recruited to score the results of the methods generated by 20 face videos. The range of the scores was from 1 to 5 points. Half of the people were male and the other half were female. The ages of the people ranged from 19 to 40. The recruits were undergraduate students, graduate students, or teachers. None of them had relationships with this research project.
In this experiment, two output videos were simultaneously shown on the monitor. The test subject scored the visual effects. The results demonstrated that 14 out of 20 people scored exactly the same for the two methods in all the videos. The scores of the other six people are listed in Table 5.
From this table it can be concluded that the visual effects generated by our compressed model were very similar with the uncompressed counterpart. This also proves the effectiveness of our proposed algorithm.

5.7. Computational Cost Analysis

The computational cost overhead for online feature tracking of our proposed method was to uncompress the data file. The face alignment process was exactly the same for our proposed method and [6]. Fortunately, the uncompress process only needs to be done once before feature tracking. The computational time of our uncompressing process was about 20 ms on iPhone 6, which is neglectable compared with the loading process of the mobile application, which is in the order of several seconds.
The extra computational cost introduced in the training process as described in Section 3.2 is listed in Table 6. Our proposed data compression algorithm was implemented in C++ with Microsoft Visual Studio 2015 IDE on a 64-bit Windows 7 operation system. All the data in Table 6 were collected on a PC equipped with Intel 3.4GHz i7-4770 CPU and 8GB RAM.
From the above table it can be concluded that the computational overhead for each regression step in the tightly coupled training process was about 12.3 + 256.7 + 9.4 = 278.4 ms. In this paper, a 6-step regressor was trained. Thus, the total computational overhead for the tightly-coupled training process was 278.4 × 6 = 1670.4 ms. Since the whole training process took about 20 min, the extra computational cost was again negligible.

6. Discussion

This paper proposed an adaptive data compression method for the training model of an SDM-based face-alignment algorithm. An efficient method to quantize the data according to the probability density-aware K-means algorithm was proposed in this paper. Furthermore, our quantization method was tightly coupled into the training process so that the accuracy loss was minimized. Experimental results proved that our proposed method was on par with the standard method while only needing less than 1/5 of the original storage space. Our method even achieved comparable results with state-of-the-art deep learning based methods in images with moderate head poses, and it consumed an order of magnitude less memory storage space. Our proposed method has the potential for the SDM algorithm to be applied in mobile applications. In our method, all the data were quantized with the same number of bits. However, it can be concluded from Figure 4 that many values were concentrated around 0. It is expected that these values can be represented by fewer numbers of bits or even pruned. In the future, we plan to do research along this direction so that the compression power can be further enhanced.

Author Contributions

Y.S., Q.J. and W.Y. conceived and designed the experiments; Y.S., B.W. and Q.Z. analyzed the data; and Y.S. wrote the paper.

Funding

This research was partially funded by National Natural Science Foundation of China under contract numbers 61501451, 51875380, 51375323 and 61563022, Cooperative Innovation Fund-Prospective of Jiangsu Province under grant BY2016044-01, Major Program of Natural Science Foundation of Jiangxi Province, China, under grant 20152ACB20009, high level talents of “Six Talent Peaks” in Jiangsu Province, China, under grant DZXX-046, Qing Lan Project of Jiangsu Province, China.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

  1. Vezzetti, E.; Marcolin, F. Geometrical descriptors for human face morphological analysis and recognition. Robot. Auton. Syst. 2012, 60, 928–939. [Google Scholar] [CrossRef] [Green Version]
  2. Basaran, E.; Gokmen, M.; Kamasak, M. An efficient multiscale scheme using local Zernike moments for face recognition. Appl. Sci. 2018, 8, 827. [Google Scholar] [CrossRef]
  3. Moos, S.; Marcolin, F.; Tornincasa, S.; Vezzetti, E.; Violante, M.G.; Fracastoro, G.; Speranza, D.; Padula, F. Cleft lip pathology diagnosis and foetal landmark extraction via 3D geometrical analysis. Int. J. Interact. Des. Manuf. 2017, 11, 1–18. [Google Scholar] [CrossRef]
  4. Naqvi, R.; Arsalan, M.; Batchuluum, G.; Yoon, H.S.; Park, K.R. Deep learning-based gaze detection system for automobile drivers using a NIR camera sensor. Sensors 2018, 18, 456. [Google Scholar] [CrossRef] [PubMed]
  5. Li, H.; Ding, H.; Huang, D.; Wang, Y.; Zhao, X.; Morvan, J.-M.; Chen, L. An efficient multimodal 2D + 3D feature-based approach to automatic facial expression recognition. Comput. Vis. Image Understand. 2015, 140, 83–92. [Google Scholar] [CrossRef]
  6. Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  7. Jin, X.; Tan, X. Face alignment in-the-wild: A survey. Comput. Vis. Image Understand. 2017, 162, 1–22. [Google Scholar] [CrossRef]
  8. Cootes, T.; Taylor, C.; Cooper, D.; Graham, J. Active shape models-their training and application. Comput. Vis. Image Understand. 1995, 61, 38–59. [Google Scholar] [CrossRef]
  9. Tzimiropoulos, G.; Panitic, M. Gauss-newton deformable part models for face alignment in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
  10. Cootes, T.; Edwards, G.; Taylor, C. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 681–685. [Google Scholar] [CrossRef] [Green Version]
  11. Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
  12. 300-W Dataset. Available online: https://ibug.doc.ic.ac.uk/resources/300-W/ (accessed on 19 August 2018).
  13. Cristinacce, D.; Cootes, T. Feature detection and tracking with constrained local models. In Proceedings of the British Machine Vision Conference, Edinburgh, UK, 4–7 September 2006. [Google Scholar]
  14. Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  15. Burgos-Artizzu, X.P.; Perona, P.; Dollar, P. Robust face landmark estimation under occlusion. In Proceedings of the International Conference on Computer Vision Workshops, Sydney, Australia, 1–8 December 2013. [Google Scholar]
  16. Kazemi, V.; Josephine, S. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
  17. Xiong, X.; De la Torr, F. Global supervised descent method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015. [Google Scholar]
  18. Sun, Y.; Wang, X.; Tang, X. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
  19. Trigeorgis, G.; Snape, P.; Nicolaou, M.A.; Antonakos, E.; Zafeiriou, S. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
  20. Jourabloo, A.; Liu, X. Large-pose face alignment via CNN-based dense 3D model fitting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
  21. Zhu, X.; Lei, Z.; Liu, X.; Shi, H.; Li, S.Z. Face alignment across large poses: A 3D solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
  22. Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230000 3D facial landmarks). In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  23. Bulat, A.; Tzimiropoulos, G. Two-stage convolutional part heatmap regression for the 1st 3D face alignment in the wild (3dfaw) challenge. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July June 2016. [Google Scholar]
  25. MacKay, D. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003; ISBN 0521642981. [Google Scholar]
  26. Pan, Z.; Chen, L.; Sun, X. Low complexity HEVC encoder for visual sensor networks. Sensors 2015, 15, 30115–30125. [Google Scholar] [CrossRef] [PubMed]
  27. Iandola, F.; Han, S.; Moskewicz, M.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv, 2016; arXiv:1602.07360. [Google Scholar]
  28. Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  29. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  30. Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or −1. arXiv, 2016; arXiv:1602.02830. [Google Scholar]
  31. Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  32. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017; arXiv:1704.04861. [Google Scholar]
  33. Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. arXiv, 2018; arXiv:1801.04381. [Google Scholar]
  34. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
  35. Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  36. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005. [Google Scholar]
  37. Yan, J.; Lei, Z.; Yi, D.; Li, S.Z. Learn to combine multiple hypotheses for accurate face alignment. In Proceedings of the Int. Conf. Computer Vision Workshops on 300-W Challenge, Sydney, Australia, 2–8 December 2013. [Google Scholar]
  38. Facial Point Annotations. Available online: https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/ (accessed on 24 October 2018).
  39. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley-Interscience Press: Hoboken, NJ, USA, 2000; ISBN 978-0-471-05669-0. [Google Scholar]
  40. Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 2009, 75, 245–248. [Google Scholar] [CrossRef] [Green Version]
  41. Murphy, K.P. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012; pp. 352–354. ISBN 978-0-262-01802-9. [Google Scholar]
  42. Sedgewick, R.; Wayne, K. Algorithms, 4th ed.; Addison-Wesley Professional Press: Boston, MA, USA, 2011; ISBN 978-0321573513. [Google Scholar]
  43. Sagonas, C.; Antonakos, E.; Tzimiropulos, G.; Zafeiriou, S.; Pantic, M. 300 faces in-the-wild challenge: Database and results. Image Vis. Comuting 2016, 47, 3–18. [Google Scholar] [CrossRef]
  44. C++11 Implementation of the Supervised Descent Optimization Method. Available online: https://github.com/patrikhuber/superviseddescent (accessed on 19 August 2018).
  45. 2D-FAN. Available online: https://www.adrianbulat.com/face-alignment/ (accessed on 19 August 2018).
  46. Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Figure 1. Face alignment example with the supervised descent method (SDM) algorithm [6]. Firstly, the face region is automatically detected in the image [11]. Then, features are localized within the detected face region. (Image courtesy of [12]).
Figure 1. Face alignment example with the supervised descent method (SDM) algorithm [6]. Firstly, the face region is automatically detected in the image [11]. Then, features are localized within the detected face region. (Image courtesy of [12]).
Applsci 08 02284 g001
Figure 2. Definition of the 68 facial landmarks. (Image courtesy of [38]).
Figure 2. Definition of the 68 facial landmarks. (Image courtesy of [38]).
Applsci 08 02284 g002
Figure 3. The flow chart of the training algorithm of SDM.
Figure 3. The flow chart of the training algorithm of SDM.
Applsci 08 02284 g003
Figure 4. Typical example of the estimated probability density function (PDF) for the elements in A1.
Figure 4. Typical example of the estimated probability density function (PDF) for the elements in A1.
Applsci 08 02284 g004
Figure 5. The flow chart of the proposed tightly-coupled training algorithm.
Figure 5. The flow chart of the proposed tightly-coupled training algorithm.
Applsci 08 02284 g005
Figure 6. The diagram of the compressed data storage arrangement.
Figure 6. The diagram of the compressed data storage arrangement.
Applsci 08 02284 g006
Figure 7. Normalized mean error loss against the standard SDM.
Figure 7. Normalized mean error loss against the standard SDM.
Applsci 08 02284 g007
Figure 8. Face alignment results for different methods on the 300 W dataset [12]. The left column shows the results with our compressed training model. The center column shows the results by using the SDM algorithm with an uncompressed training model [6]. The right column shows the results with the 2D-FAN [22].
Figure 8. Face alignment results for different methods on the 300 W dataset [12]. The left column shows the results with our compressed training model. The center column shows the results by using the SDM algorithm with an uncompressed training model [6]. The right column shows the results with the 2D-FAN [22].
Applsci 08 02284 g008aApplsci 08 02284 g008b
Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on the 300 W test set [12].
Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on the 300 W test set [12].
Applsci 08 02284 g009
Figure 10. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.
Figure 10. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.
Applsci 08 02284 g010
Figure 11. Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM [6] on the 300 W test set [12].
Figure 11. Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM [6] on the 300 W test set [12].
Applsci 08 02284 g011
Figure 12. Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.
Figure 12. Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.
Applsci 08 02284 g012
Figure 13. Normalized mean errors of our proposed method versus the method without K-means clustering.
Figure 13. Normalized mean errors of our proposed method versus the method without K-means clustering.
Applsci 08 02284 g013
Figure 14. Sample AR (Augmented Reality) effects of the mobile application using our proposed compressed model version of SDM.
Figure 14. Sample AR (Augmented Reality) effects of the mobile application using our proposed compressed model version of SDM.
Applsci 08 02284 g014
Table 1. Data ranges for a typical HoG-based 6-step regressor.
Table 1. Data ranges for a typical HoG-based 6-step regressor.
Step IndexAdbd
Minimum ValueMaximum ValueMinimum ValueMaximum Value
1−0.00720.0075−0.25970.1381
2−0.00810.0075−0.17740.1308
3−0.00500.0060−0.10930.0644
4−0.00400.0042−0.06800.0400
5−0.00320.0037−0.02950.0283
6−0.00240.0031−0.01800.0157
Table 2. Comparisons of normalized mean errors of our proposed method versus the method with random initialization for the K-means algorithm.
Table 2. Comparisons of normalized mean errors of our proposed method versus the method with random initialization for the K-means algorithm.
Face PartsOur Proposed MethodThe Method with Random Initialization (100 Trials)
Face contour0.04270.0568 ± 0.0047
Eyebrows0.04070.0559 ± 0.0038
Eyes0.02870.0401 ± 0.0032
Nose0.02720.0396 ± 0.0035
Mouth0.03590.0512 ± 0.0052
Table 3. The normalized mean error loss against the standard SDM for different choices of nb.
Table 3. The normalized mean error loss against the standard SDM for different choices of nb.
nbThe Normalized Mean Error Loss against the Standard SDM
40.0241
50.0109
60.00274
70.00253
80.00235
90.00212
100.00193
110.00180
120.00167
130.00151
140.00134
150.00123
160.00121
Table 4. The normalized mean error loss against the standard SDM for different choices of D.
Table 4. The normalized mean error loss against the standard SDM for different choices of D.
D12345678
The normalized mean error loss against the standard SDM0.01220.004410.003720.003250.002910.002740.002690.00263
Table 5. Comparisons of average scores for the two methods.
Table 5. Comparisons of average scores for the two methods.
IndexAverage Score of Uncompressed Training Model SDM [6]Average Score of Our Compressed Model Method
14.104.05
23.803.70
34.154.10
43.954.00
54.404.45
63.853.75
Table 6. Extra computational cost for each part of the algorithm in the tightly-coupled training process.
Table 6. Extra computational cost for each part of the algorithm in the tightly-coupled training process.
Module of the AlgorithmAverage Time (ms)
Probability density-aware initialization12.3
K-means clustering256.7
Data quantization9.4

Share and Cite

MDPI and ACS Style

Shen, Y.; Jiang, Q.; Wang, B.; Zhu, Q.; Yang, W. Tightly-Coupled Data Compression for Efficient Face Alignment. Appl. Sci. 2018, 8, 2284. https://doi.org/10.3390/app8112284

AMA Style

Shen Y, Jiang Q, Wang B, Zhu Q, Yang W. Tightly-Coupled Data Compression for Efficient Face Alignment. Applied Sciences. 2018; 8(11):2284. https://doi.org/10.3390/app8112284

Chicago/Turabian Style

Shen, Yehu, Quansheng Jiang, Bangfu Wang, Qixin Zhu, and Wenming Yang. 2018. "Tightly-Coupled Data Compression for Efficient Face Alignment" Applied Sciences 8, no. 11: 2284. https://doi.org/10.3390/app8112284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop