Tightly-Coupled Data Compression for Efficient Face Alignment

Shen, Yehu; Jiang, Quansheng; Wang, Bangfu; Zhu, Qixin; Yang, Wenming

doi:10.3390/app8112284

Open AccessArticle

Tightly-Coupled Data Compression for Efficient Face Alignment

by

Yehu Shen

^1,2,*

,

Quansheng Jiang

^1,2

,

Bangfu Wang

^1,2,

Qixin Zhu

^1,2 and

Wenming Yang

³

¹

College of Mechanical Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

²

Suzhou Key Laboratory of Precision and Efficient Machining Technology, Suzhou 215009, China

³

Graduate School at Shenzhen, Tsinghua University, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2018, 8(11), 2284; https://doi.org/10.3390/app8112284

Submission received: 12 September 2018 / Revised: 2 November 2018 / Accepted: 15 November 2018 / Published: 19 November 2018

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

The proposed method in this paper is suitable for resource restricted environments such as mobile face alignment applications.

Abstract

Face alignment is the key component for applications such as face and expression recognition, face based AR (Augmented Reality), etc. Among all the algorithms, cascaded-regression based methods have become popular in recent years for their low computational costs and satisfactory performances in uncontrolled environments. However, the size of the trained model is large for cascaded-regression based methods, which makes it difficult to be applied in resource restricted scenarios such as applications on mobile phones. In this paper, a data compression method for the trained model of supervised descent method (SDM) is proposed. Firstly, according to the distribution of the model data estimated with the non-parametric method, a K-means based data quantization algorithm with probability density-aware initialization was proposed to efficiently quantize the model data. Then, a tightly-coupled SDM training algorithm was proposed so that the training process reduced the errors caused by data quantization. Quantitative experimental results proved that our proposed method compressed the trained model to less than 19% of its original size with very similar feature localization performance. The proposed method opens the gates to efficient mobile face alignment applications based on SDM.

Keywords:

model compression; tightly-coupled training; feature localization

Graphical Abstract

1. Introduction

Face alignment is an important part of facial image analysis. It automatically localizes facial feature points such as eyes, nose, eyebrows, mouth, etc. from a face image. It plays an important role in popular applications such as face recognition [1,2], attribute computing [3,4], and expression recognition [5]. Face alignment technology is usually applied to get some anchor points for affine warping so that the face recognition procedure is robust against pose variations. In [3], facial landmarks were used to outline a fetus’s face. In [4], facial landmarks helped to localize and represent salient regions of the face. Figure 1 shows an example of a face alignment algorithm with the supervised descent method (SDM) [6].

According to a recent survey [7], face alignment methods can be divided into two categories: generative methods and discriminative methods.

Generative methods explicitly construct generative models for the shape and/or appearance of the face. Feature locations are derived according to the best fit of the model to the test image. Cootes et al. proposed a famous algorithm that is called active shape models (ASM) to calculate an appearance model for every facial part separately [8]. In [9], the authors proposed the Gauss-Newton Deformable Part Model (GN-DPM) that can construct generative models for each facial part simultaneously. The classical active appearance models (AAM) [10], which contain the shape model, appearance model, and motion model, also belong to this category. The AAM’s defect is that this method is not robust against occlusions.

Different from generative methods, discriminative methods aim to estimate the mapping between the facial appearances and the features’ locations directly. Constrained local models (CLM) are able to learn independent local detector for each feature point [13]. Then, a shape model is used to regularize these local models. Different from CLM, cascaded regression methods learn a vectorial regression function directly to calculate the face shape stage-by-stage. Explicit shape regression (ESR) [14] was one of the first algorithms in this category. It is a two-stage boosted regression framework. Burgos-Artizzu et al. introduced occlusion information into the regression process [15] in order to improve robustness. Kazemi and Josephine proposed to use regression trees instead of random ferns and achieved super-fast speed [16]. Besides the above mentioned two-level boosted regression framework, Xiong and De la Torre presented a cascaded linear regression method with hand-crafted features [6]. The contribution of [6] is a provable supervised descent method (SDM). It also extends SDM to Global SDM in order to cope with the problem of conflicting gradient directions [17]. SDM is a popular face alignment method especially for resource restricted applications since it can achieve state-of-the-art results in real 2D scenarios and still achieve real-time performance.

With the advent of the deep learning era, deep neural networks have been successfully applied in many computer vision tasks in recent years. Sun et al. were the first to use the deep convolutional network cascade for face alignment [18]. Reference [19] proposed a recurrent neural networks approach. Recently, there were works to achieve 3D face alignment by fitting a 3D Morphable Model (3DMM) through convolutional neural networks (CNN) [20,21]. Reference [22] proposed a 3D face alignment network (3D-FAN) by stacking four hourglass networks. Reference [23] proposed a two-stage method with the help of a deep residual network [24]. In this method, heat-maps of 2D landmarks are first calculated using convolutional-part heat-map regression. Then, these heat-maps along with the original RGB image are used to regress the depth information with a very deep residual network. Although deep learning methods, especially 3D ones, perform better on images with large head poses than do traditional methods, they are not easily applied in mobile platforms. The main reasons are as follows: (1) The deep learning model is usually in the order of 100 MB, which is too big for mobile applications. (2) The computational cost is still quite high and can hardly achieve real-time performances on mobile platforms. (3) Deep learning model need huge amounts of training data, which are not easy to collect for the face alignment task. (4) The training process of deep learning models is tricky without the help of open source implementations from the authors.

As mentioned above, SDM achieves satisfactory results while having a relatively low computational cost. However, the size of the trained model for SDM can be more than 80 MB, which is still too large for a commercial mobile application. Unfortunately, traditional lossless compression technology such as entropy coding [25] cannot achieve a high enough compression rate to meet the needs of mobile applications. Lossy compression technology is widely used in video encoding areas. State-of-the-art methods such as HEVC (High Efficiency Video Coding) can reach a very high compression rate with good visual quality [26]. Unfortunately, this kind of technology heavily depends on the block motion estimations between consecutive frames in time domains, which are obviously unavailable in our work.

Recently, research about deep learning network compression has gradually emerged. Reference [27] proposed new kinds of convolutional operations to reduce parameters. Reference [28] compressed the network by pruning unimportant filters according to weight analysis. References [29,30] transferred the weights into binary values in order to reduce the size of the model. Instead of transferring the weights into binary values, Zhu et al. proposed a method to reduce the precision of weights into ternary values [31]. This method avoids most accuracy degradation. Howard et al. proposed depth-wise separable convolutions architecture so that traditional 3D convolutions can be broken into 2D convolutions [32]. This makes it suitable for mobile applications. Based on [32], shortcut technology was introduced in [33]. Furthermore, linear bottlenecks were used instead of Relu [34] in order to keep the features. Most of the above methods are aimed at specific tasks or network structures such as image classification and image segmentation. To the best of our knowledge, there is no compression architecture that can be applied directly to face alignment networks.

In this paper, a tightly-coupled data compression method for the trained model of supervised descent method (SDM) that can reduce the size to less than 1/5 of the original size without obvious performances loss is proposed. This method opens the gates to mobile applications using SDM based face alignment technology.

The remainder of the paper is organized as follows: Section 2 briefly describes the main procedure of the SDM algorithm [6] for the sake of completeness and clarity; Section 3 explains our proposed method in detail; Section 4 demonstrates some detailed algorithm implementations in the proposed method; Section 5 shows both qualitative and quantitative experimental results; Section 6 draws conclusions.

2. Basics of SDM

This section briefly introduces the main workflow of SDM. Please refer to [6] for details.

We assume that the face feature points are represented by N 2D landmarks s = [x₁, y₁, …, x_N, y_N]^T. Usually, N = 68. Figure 2 shows the definition of the 68 face landmarks. Given a face image I and the initial 2D landmarks s₀ estimated from the detected face region, our aim is to find a series of regressors:

R = r₁ ∙∙∙ r_D

where r_d = {A_d, b_d}(d = 1 ∙∙∙ D), A_d is the projection matrix that can also be called the descent direction and b_d is the bias term. In SDM, D is usually chosen between 4 and 6, and the estimated 2D landmarks at dth step are calculated according to the following equation:

s_d = s_d₋₁ + A_d∙f(I, s_d₋₁) + b_d

(1)

f(I, s_d₋₁) are the shape related features that can be SIFT (Scale-Invariant Feature Transform)features [35] or HoG(Histograms of oriented Gradients) features [36] for better performance [37]. These features can calculated at the landmarks s_d_-1 in image I. The final estimated facial landmarks are s_D.

Given M training images, SDM aims to minimize a series of the following equations and get r_d sequentially:

\underset{A_{d}, b_{d}}{\arg \min} \sum_{i = 1}^{M} {‖ Δ s_{d - 1}^{i} - A_{d} \cdot f (I_{i}, s_{d - 1}^{i}) - b_{d} ‖}^{2}

(2)

where

Δ s_{d}^{i}

is the shape residual of the ith training image at the dth regression step:

Δ s_{d}^{i} = s_{*}^{i} - s_{d}^{i}

(3)

s_{*}^{i}

are the ground truth landmarks locations of the ith training image. Equation (2) is a standard linear least squares problem and can solved in closed-form.

The following Figure 3 shows the flow chart of the training algorithm of SDM.

3. Tightly-Coupled Data Compression Algorithm

As shown in Section 2, the key components of SDM are r_d for all D steps. Thus, the trained model of SDM stores all the information of r_d for all D steps. Since all the features calculated from every landmark are concatenated together and form a large feature vector, the dimension of the feature vector can easily achieve 27,200 if HoG features are used. So the dimension of A_d is 136 × 27,200 and b_d is a 136-dimensional vector. Typically, each component in A_d and b_d is represented by a single-precision floating point number. Table 1 shows the data ranges of A_d and b_d for a typical HoG-based 6-step regressor.

From the table it can be concluded that the data ranges vary in different regression steps. So in this paper, the method of compressing the data separately for each step is proposed. Furthermore, the data range of A_d is very different from that of b_d, and b_d only contains 136 floating point numbers in each step, which is much smaller (only about 3KB for all 6 steps) than the size of A_d. Therefore, in this paper, only the data in A_d are compressed.

Since the data distribution of A_d can not be described by a parametric model, a non-parametric method for estimating the data distribution of A_d is applied in this paper. The number of elements in A_d is large enough, so for the sake of computational efficiency, the Parzen window method [39] is used here. Assuming that the number of elements in A_d is T, and window size is h, the probability density function (PDF) of the element x in A_d can be estimated through the following equation:

p (x) = \frac{1}{T} \sum_{i = 1}^{T} \frac{1}{h} ϕ (\frac{x - x_{i}}{h})

(4)

where

ϕ (x)

is the square window function

ϕ (x) = {\begin{matrix} \begin{matrix} 1 & | x | \leq 0.5 \end{matrix} \\ \begin{matrix} 0 & o t h e r w i s e \end{matrix} \end{matrix}

(5)

Figure 4 shows the estimated PDF of the elements in A₁. The shapes of the PDFs in other steps are similar. From the figure it can be concluded that most of the values in A_d concentrate around 0, and the distribution also shows some long tail effects. As a result, it is inappropriate to compress the data according to uniform quantization. In this paper, a K-means based data quantization algorithm with a probability density-aware initialization technique is proposed in order to cope with the above mentioned difficulties.

3.1. K-Means Based Data Quantization with Probability Density-Aware Initialization

The basic idea of our proposed data compression algorithm is to quantize all the elements of A_d into Q predefined values so that each element can be represented with fewer bits. The optimal Q predefined values can be calculated through minimizing the following equation:

E = \sum_{k = 1}^{Q} \sum_{i = 1}^{T} 1 (A_{d} (i) \in C_{k}) \cdot {‖ A_{d} (i) - q_{k} ‖}_{2}^{2}

(6)

where 1(∙) is the characteristic function, T is the number of elements in A_d, A_d(i) is the ith element in A_d, C = {C₁, C₂, …, C_Q} divides the data of A_d into Q disjoint clusters, and q_k is the representative scalar of cluster k. Solving the minimization problem of Equation (6) is NP-hard [40]. The K-means algorithm [41] can be applied to get the approximate solution. The performance of K-means algorithm heavily depends on the initialization. As shown in Figure 4, the data distribution of A_d shows single peak characteristics. Traditional random initialization for the K-means algorithm cannot capture the characteristics of the data distribution of A_d, so that the acquired result is far from optimal.

A probability density-aware initialization method, which is to say the initial quantization step size (cluster size) is inversely proportional to the probability density distribution of the data in order to fully capture the data distribution of A_d, is proposed in this paper. Thus, we have

\begin{matrix} \int_{v_{k - 1}}^{v_{k}} p (x) d x = c o n s t . = \frac{1}{Q} & (k = 1 \dots Q) \end{matrix}

(7)

where v_k₋₁ and v_k are lower and upper bounds of the kth quantization step respectively. So the optimal initialization strategy can be estimated through the following equation:

\begin{array}{l} \underset{v_{1} \dots v_{Q - 1}}{\arg \min} \sum_{k = 1}^{Q} {‖ \int_{v_{k - 1}}^{v_{k}} p (x) d x - \frac{1}{Q} ‖}^{2} \\ \begin{matrix} s . t . & \begin{array}{l} v_{0} = V_{\min} \\ v_{Q} = V_{\max} \end{array} \end{matrix} \end{array}

(8)

where V_min and V_max are minimum and maximum values of A_d as shown in Table 1. Unfortunately, the exact solution of the above equation is difficult. In this paper, an approximate solution to the above problem is proposed as follows since only reasonable initializations for the K-means algorithm are needed.

All the elements in A_d are sorted in ascending order and stored in an array SA. The lowest and highest quantization step bounds can be calculated with the following equation:

v₀ = SA(1), v_Q = SA(T)

(9)

The remaining quantization step bounds are calculated with Equations (10) and (11) as follows:

i d x = ⌊ \frac{k \cdot T}{Q} ⌋

(10)

v_k = SA(idx) (k = 1 … Q − 1)

(11)

where

⌊ • ⌋

is the floor function that rounds the element to the nearest integer towards minus infinity.

With the above algorithm, the number of elements between consecutive quantization steps bounds is nearly the same, so that Equation (8) is approximately solved. The most important thing is that all the quantization steps bounds can be efficiently estimated with the above algorithm. However, we do not choose the mid-value between the lower and upper bounds as initial values for the K-means algorithm since the data distribution inside the quantized region is not uniform. Instead, the initial value for the K-means algorithm is set as the mean value of all the elements that fall in the same quantized region as follows. First, the set of all the elements that belong to the kth quantization step is calculated.

U_{k} = {A_{d} (i) | v_{k - 1} < A_{d} (i) \leq v_{k}, 1 \leq i \leq T}

(12)

Then, the initial value of the kth cluster center for the K-means algorithm can be calculated as follows:

μ_{k} = \frac{1}{| U_{k} |} \sum_{j = 1}^{| U_{k} |} U_{k} (j)

(13)

With these initial values, the K-means algorithm [41] can be applied so that the optimal clusters’ centers q_k (k = 1 ∙∙∙ Q) with regard to Equation (6) are estimated. Thus, the quantized value AQ_d(i) (i = 1 ∙∙∙ T) that corresponds to each element A_d(i) (i = 1 ∙∙∙ T) can be calculated with the following equations:

i d x_{i} = \arg \min_{k \in {1, 2, \dots, Q}} {‖ A_{d} (i) - q_{k} ‖}_{2}

(14)

A Q_{d} (i) = q_{i d x_{i}}

(15)

3.2. Tightly-Coupled Training Algorithm

If we directly quantize the final learned result of A_d as described in the previous section, the quantization process will introduce extra errors to the feature localization results. In order to reduce the errors introduced by data quantization processes, we modified the traditional training algorithm of SDM described in Section 2 and propose the following tightly-coupled training algorithm as shown in Figure 5.

In the above algorithm, the data quantization process is coupled with the training process as shown in steps 4 and 5 so that the errors caused by quantization in the previous step are propagated into the next regression step. As a result, the projection matrix in the next step can partially correct the errors introduced by the quantization process and the final results can be improved.

3.3. Compressed Model Data Storage Arrangement

The final compressed model consists of the stacked version of all D steps of regressors r_d. Each regressor r_d consists of three parts: Q quantized values for the projection matrix; the quantized projection matrix AQ_d that corresponds to A_d; the bias term b_d.

The quantized values are stored with single-precision floating point numbers. There are Q single-precision floating point numbers for each regression step.

The quantized projection matrix AQ_d has the same dimension as that of A_d. Each element in AQ_d is an index to the quantized value, which was described above. Since there are Q different quantized values, each element in AQ_d only consists of log₂Q bits that usually consume much fewer bits than 32-bit floating point numbers. Through AQ_d, the corresponding quantized value can be fetched according to the index, and the approximate projection matrix can be reconstructed. In such a way, the data compression purpose can be achieved. Throughout this paper, we chose Q = 64, which is justified in the experimental results section.

The bias term b_d is stored directly as a floating point number since its size is relatively small, as stated before. Figure 6 illustrates the diagram of the compressed data storage arrangement.

4. Methodology

In this section, implementation details about the algorithm proposed in Section 3.1 are described. Algorithm 1 demonstrates the pseudo codes of the approximate algorithm for solving Equation (8).

Algorithm 1. Approximate algorithm for solving Equation (8).

Input: Projection matrix A_d at the dth step; number of elements T in A_d; number of quantization levels Q.
Output: quantization steps bounds v₀ ∙∙∙ v_Q

Sort all the elements of A_d in ascending order and store them in an array SA with quick sort [42].
Calculate v₀ and v_Q with Equation (9).
Fork = 1 to Q – 1,
Calculate idx with Equation (10).
Get v_k with Equation (11).

end for

Algorithm 2 shows the pseudo codes for the whole procedure of the proposed data quantization algorithm.

Algorithm 2. The proposed data quantization algorithm.

Input: Projection matrix A_d (d = 1 ∙∙∙ D) for all D steps; number of quantization levels Q.
Output: Quantized projection matrix AQ_d (d = 1 ∙∙∙ D) for all D steps.
For d = 1 to D,

Calculate the optimal quantization steps bounds v₀ ∙∙∙ v_Q according to A_d with Algorithm 1.
Initialize cluster centers μ_k (k = 1 ∙∙∙ Q) for the K-means algorithm with Equations (12) and (13).
Minimize Equation (6) with the K-means algorithm [41] and get the optimal clusters’ centers q_k (k = 1 ∙∙∙ Q).
For each element A_d(i) (i = 1 ∙∙∙ T), its corresponding quantized value AQ_d(i) (i = 1 ∙∙∙ T) can be calculated with Equations (14) and (15).

end for

5. Results

In this section, our method was compared against the standard SDM and deep learning based method [22] on a 300 W dataset [12]. This dataset is a publically available challenging dataset that consists of 600 indoor and outdoor in-the-wild images. It covers a large variation of identity, expression, illumination conditions, pose, occlusion, and face size. Each image has ground truth locations of 68-point configuration [43]. The open source implementation of standard SDM by Patrik Huber [44] and the implementation of the authors [45] for [22] were used in this paper. Since the SDM based method does not apply any 3D information, only the 2D face alignment network (2D-FAN) version was tested in the deep learning based method [22] for the sake of fairness.

5.1. The Choice of Q

In this section, the face alignment accuracy was evaluated according to the average distance between the detected landmarks and the ground truth, normalized by the inter-ocular distance as proposed in [46]. The number of bits used for each element in AQ_d was varied and their corresponding normalized mean error loss against standard SDM algorithm [6] was calculated. Figure 7 shows the result.

From the above figure it can be concluded that if each element in AQ_d used more than 6 bits, the loss was small and reduced smoothly. However, if each element was represented with less than 6 bits, the error loss increased rapidly. In this paper we chose 6 bits, considering the balance between the error loss and compression efficiency. This means Q = 2⁶ = 64.

In this paper, HoG features were utilized and 68 feature points were detected. The number of regression steps was 6. The dimensions of AQ_d and A_d were both 136 × 27,200, as mentioned before. The dimension of bias vector b_d was 136. For standard SDM, all the elements were represented by 32-bit single precision floating point numbers. Therefore, the total space needed for the training model was

(136 × 27,200 + 136) × 4 × 6 = 88,784,064 Bytes

With our proposed method, we needed to store 64 single precision floating point numbers for quantized values, 136 × 27,200 6-bit matrix AQ_d and 136-dimensional single precision floating point vector b_d for each regression step. Therefore, our storage consumption for the whole training data was

(64 × 4 + 136 × 27,200 × 6/8 + 136 × 4) × 6= 16,651,200 Bytes

Our model size was only about 18.75% of the standard one, which means the compression rate of our proposed method was about 5.3X. Furthermore, when entropy coding [25] was applied to our proposed compressed data, about 10% more compression power was usually obtained.

5.2. Qualitative Experimental Results

In this section our feature localization results were compared against the standard SDM with an uncompressed training model [6] and 2D-FAN [22]. Figure 8 shows the results on the 300 W dataset [12]. The left column shows the results with our compressed training model. The center column shows the results by using the SDM algorithm with an uncompressed training model [6]. The right column shows the results with the 2D-FAN [22]. Red lines in all the images depict ground truth feature locations. From this figure it can be concluded that our compressed training model can generate very similar results with its uncompressed counterpart. They can both fit the ground truth very well even in occlusion scenarios. Deep learning based method [22] performs slightly better, especially in images with large head poses such as the first and the last rows. However, the model size of the 2D-FAN is about 182 MB [45], which is obviously not suitable for mobile applications. The accuracy of our proposed method is enough for mobile applications such as a virtual add-on, which is verified in Section 5.6.

5.3. Quantitative Experimental Results

As shown in Figure 2, the face features can be divided into five parts: face contour, eyebrows, eyes, nose, and mouth. Face contour contains points No. 1 to No. 17. Eyebrows contain points No. 18 to No. 27. Eyes contain points No. 37 to No. 48. Nose contains points No. 28 to No. 36. Mouth contains points No. 49 to No. 68. The average values of the normalized mean errors in the five parts, respectively, on the 300 W test dataset were estimated. We compared our compressed training model against the uncompressed training model and deep learning based method. The results are shown in Figure 9. From the figure it can be concluded that our proposed compressed training model can achieve very close feature points compared with the uncompressed training model. The deep learning based method performs slightly better. However, the differences of the normalized mean errors on the 300 W test dataset between our proposed method and the deep learning based method are all below 1% for all five parts, which is acceptable considering the high computational cost and memory storage usage of the deep learning based method.

Similar to the work in [22], a subset was chosen from the 300 W test dataset whose yaw angles are between 0 and 30 degrees. Experiments on this subset with the above three methods were conducted. The results are shown in Figure 10. From this figure, we can find that for moderate head poses, which are the typical scenarios for mobile applications, all three methods can achieve lower normalized errors and generate very similar results, especially for eyes and face contour parts. These two parts are very important for AR (Augmented Reality) based applications. The differences of the normalized mean errors between our proposed method and deep learning based method were less than 0.6% for most parts and even achieved 0.3% for the eyes regions. This proves the effectiveness of our proposed algorithm.

Figure 11 shows the cumulative error distribution curve of our proposed method and the uncompressed training model SDM [6]. It is obvious that the two curves are very close to each other. This again confirms the similar performance of both methods despite our training model being much smaller.

5.4. Ablation Study

5.4.1. Effect of the Tightly-Coupled Training Algorithm

The effect of the tightly-coupled training algorithm was analyzed in this section. We compared the results of our proposed method with the method without a tightly-coupled training algorithm, which is to say the quantization results of Section 3.1 were applied directly. The results are shown in Figure 12. From the figure it can be concluded that without the tightly-coupled training algorithm, normalized mean error for each feature point increased by more than 2.5%. Considering that the average of the normalized mean error was about 3.5%, this was a great increase in localization error. This proves that the coupling process successfully reduces the errors caused by the data quantization step.

5.4.2. Effect of Probability Density-Aware Initialization

Investigation of the effectiveness of our proposed probability density-aware initialization technique in Section 3.1 was conducted in this section. Our proposed method was compared with standard random initialization for the K-means algorithm, which is to say we randomly chose Q elements from A_d and set them as the initial cluster centers for the K-means algorithm instead of μ_k as calculated according to Equation (13). We tried 100 times and calculated the averages and standard deviations of normalized mean errors for the five parts of the face. The results are shown in Table 2. From the table it can be found that the normalized mean errors increased by more than 1.3% when random initialization was used. The reason might be that the data in A_d concentrated at specific points, as shown in Figure 4. If we randomly selected Q elements and set them as the initial cluster centers, all these initial cluster centers fell near the mode of the PDF with high probability. Thus, the quantization errors for the elements in A_d, which are far from the mode of the PDF, were very high and caused large localization errors.

5.4.3. Effect of K-Means Clustering

Our proposed method was compared with the method without using the K-means algorithm described in Section 3.1. That is to say μ_k, which is calculated according to Equation (13), is used directly as the quantization center. Figure 13 shows the result. Without K-means clustering, normalized mean errors increased by about 0.8%. This proves K-means clustering successfully minimizes the quantization errors in Equation (6).

5.5. Parameter Sensitivity Analysis

There are two important parameters in our proposed method, one is the number of bits nb = log₂Q used to encode each element in A_d. The other is the number of regression steps D. Table 3 shows the normalized mean error loss against the standard SDM for different choices of nb.

It can be concluded from the table that the normalized mean error loss against the standard SDM decreased almost linearly when the number of bits was not smaller than 6. The error loss difference between nb = 6 and nb = 16 was quite small and was hardly noticeable in video applications. The above data justified the choice of Q = 2⁶ = 64 in all our experiments.

Experiments for the normalized mean error loss against the standard SDM with different choices of regression steps D were also conducted in this section. Table 4 shows the results.

This table verifies that the normalized mean error loss was not sensitive to the choice of the number of regression steps if D > 1. This justifies the robustness of our proposed algorithm. In reality, it is uncommon to choose very small D values unless the computing resources are extremely restricted since feature localization accuracy is not guaranteed even with the standard SDM. This table also reveals the fact that with our proposed method we are free to choose D according to the demand of the application because the accuracy loss is not sensitive to D.

5.6. User Study for AR Mobile Applications

An AR mobile application with SDM using our proposed compressed trained model was developed. Sample effects of this application are shown in Figure 14. This application adds some interesting virtual decorations to the face video in real-time.

Twenty short face videos of different people were recorded. The length of each video was about 30 s. AR effects for these face videos with our developed mobile application were generated. Each face video generated two output videos: one with standard SDM [6] and the other with our proposed compressed model. These two videos have the same AR effect, but different face videos had different AR effects.

Twenty people were recruited to score the results of the methods generated by 20 face videos. The range of the scores was from 1 to 5 points. Half of the people were male and the other half were female. The ages of the people ranged from 19 to 40. The recruits were undergraduate students, graduate students, or teachers. None of them had relationships with this research project.

In this experiment, two output videos were simultaneously shown on the monitor. The test subject scored the visual effects. The results demonstrated that 14 out of 20 people scored exactly the same for the two methods in all the videos. The scores of the other six people are listed in Table 5.

From this table it can be concluded that the visual effects generated by our compressed model were very similar with the uncompressed counterpart. This also proves the effectiveness of our proposed algorithm.

5.7. Computational Cost Analysis

The computational cost overhead for online feature tracking of our proposed method was to uncompress the data file. The face alignment process was exactly the same for our proposed method and [6]. Fortunately, the uncompress process only needs to be done once before feature tracking. The computational time of our uncompressing process was about 20 ms on iPhone 6, which is neglectable compared with the loading process of the mobile application, which is in the order of several seconds.

The extra computational cost introduced in the training process as described in Section 3.2 is listed in Table 6. Our proposed data compression algorithm was implemented in C++ with Microsoft Visual Studio 2015 IDE on a 64-bit Windows 7 operation system. All the data in Table 6 were collected on a PC equipped with Intel 3.4GHz i7-4770 CPU and 8GB RAM.

From the above table it can be concluded that the computational overhead for each regression step in the tightly coupled training process was about 12.3 + 256.7 + 9.4 = 278.4 ms. In this paper, a 6-step regressor was trained. Thus, the total computational overhead for the tightly-coupled training process was 278.4 × 6 = 1670.4 ms. Since the whole training process took about 20 min, the extra computational cost was again negligible.

6. Discussion

This paper proposed an adaptive data compression method for the training model of an SDM-based face-alignment algorithm. An efficient method to quantize the data according to the probability density-aware K-means algorithm was proposed in this paper. Furthermore, our quantization method was tightly coupled into the training process so that the accuracy loss was minimized. Experimental results proved that our proposed method was on par with the standard method while only needing less than 1/5 of the original storage space. Our method even achieved comparable results with state-of-the-art deep learning based methods in images with moderate head poses, and it consumed an order of magnitude less memory storage space. Our proposed method has the potential for the SDM algorithm to be applied in mobile applications. In our method, all the data were quantized with the same number of bits. However, it can be concluded from Figure 4 that many values were concentrated around 0. It is expected that these values can be represented by fewer numbers of bits or even pruned. In the future, we plan to do research along this direction so that the compression power can be further enhanced.

Author Contributions

Y.S., Q.J. and W.Y. conceived and designed the experiments; Y.S., B.W. and Q.Z. analyzed the data; and Y.S. wrote the paper.

Funding

This research was partially funded by National Natural Science Foundation of China under contract numbers 61501451, 51875380, 51375323 and 61563022, Cooperative Innovation Fund-Prospective of Jiangsu Province under grant BY2016044-01, Major Program of Natural Science Foundation of Jiangxi Province, China, under grant 20152ACB20009, high level talents of “Six Talent Peaks” in Jiangsu Province, China, under grant DZXX-046, Qing Lan Project of Jiangsu Province, China.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

Vezzetti, E.; Marcolin, F. Geometrical descriptors for human face morphological analysis and recognition. Robot. Auton. Syst. 2012, 60, 928–939. [Google Scholar] [CrossRef] [Green Version]
Basaran, E.; Gokmen, M.; Kamasak, M. An efficient multiscale scheme using local Zernike moments for face recognition. Appl. Sci. 2018, 8, 827. [Google Scholar] [CrossRef]
Moos, S.; Marcolin, F.; Tornincasa, S.; Vezzetti, E.; Violante, M.G.; Fracastoro, G.; Speranza, D.; Padula, F. Cleft lip pathology diagnosis and foetal landmark extraction via 3D geometrical analysis. Int. J. Interact. Des. Manuf. 2017, 11, 1–18. [Google Scholar] [CrossRef]
Naqvi, R.; Arsalan, M.; Batchuluum, G.; Yoon, H.S.; Park, K.R. Deep learning-based gaze detection system for automobile drivers using a NIR camera sensor. Sensors 2018, 18, 456. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Ding, H.; Huang, D.; Wang, Y.; Zhao, X.; Morvan, J.-M.; Chen, L. An efficient multimodal 2D + 3D feature-based approach to automatic facial expression recognition. Comput. Vis. Image Understand. 2015, 140, 83–92. [Google Scholar] [CrossRef]
Xiong, X.; De la Torre, F. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Jin, X.; Tan, X. Face alignment in-the-wild: A survey. Comput. Vis. Image Understand. 2017, 162, 1–22. [Google Scholar] [CrossRef]
Cootes, T.; Taylor, C.; Cooper, D.; Graham, J. Active shape models-their training and application. Comput. Vis. Image Understand. 1995, 61, 38–59. [Google Scholar] [CrossRef]
Tzimiropoulos, G.; Panitic, M. Gauss-newton deformable part models for face alignment in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Cootes, T.; Edwards, G.; Taylor, C. Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 681–685. [Google Scholar] [CrossRef] [Green Version]
Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
300-W Dataset. Available online: https://ibug.doc.ic.ac.uk/resources/300-W/ (accessed on 19 August 2018).
Cristinacce, D.; Cootes, T. Feature detection and tracking with constrained local models. In Proceedings of the British Machine Vision Conference, Edinburgh, UK, 4–7 September 2006. [Google Scholar]
Cao, X.; Wei, Y.; Wen, F.; Sun, J. Face alignment by explicit shape regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Burgos-Artizzu, X.P.; Perona, P.; Dollar, P. Robust face landmark estimation under occlusion. In Proceedings of the International Conference on Computer Vision Workshops, Sydney, Australia, 1–8 December 2013. [Google Scholar]
Kazemi, V.; Josephine, S. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014. [Google Scholar]
Xiong, X.; De la Torr, F. Global supervised descent method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015. [Google Scholar]
Sun, Y.; Wang, X.; Tang, X. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Trigeorgis, G.; Snape, P.; Nicolaou, M.A.; Antonakos, E.; Zafeiriou, S. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
Jourabloo, A.; Liu, X. Large-pose face alignment via CNN-based dense 3D model fitting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
Zhu, X.; Lei, Z.; Liu, X.; Shi, H.; Li, S.Z. Face alignment across large poses: A 3D solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July 2016. [Google Scholar]
Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230000 3D facial landmarks). In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Bulat, A.; Tzimiropoulos, G. Two-stage convolutional part heatmap regression for the 1st 3D face alignment in the wild (3dfaw) challenge. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, VA, USA, 26 June–1 July June 2016. [Google Scholar]
MacKay, D. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003; ISBN 0521642981. [Google Scholar]
Pan, Z.; Chen, L.; Sun, X. Low complexity HEVC encoder for visual sensor networks. Sensors 2015, 15, 30115–30125. [Google Scholar] [CrossRef] [PubMed]
Iandola, F.; Han, S.; Moskewicz, M.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv, 2016; arXiv:1602.07360. [Google Scholar]
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or −1. arXiv, 2016; arXiv:1602.02830. [Google Scholar]
Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017; arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. arXiv, 2018; arXiv:1801.04381. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005. [Google Scholar]
Yan, J.; Lei, Z.; Yi, D.; Li, S.Z. Learn to combine multiple hypotheses for accurate face alignment. In Proceedings of the Int. Conf. Computer Vision Workshops on 300-W Challenge, Sydney, Australia, 2–8 December 2013. [Google Scholar]
Facial Point Annotations. Available online: https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/ (accessed on 24 October 2018).
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley-Interscience Press: Hoboken, NJ, USA, 2000; ISBN 978-0-471-05669-0. [Google Scholar]
Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 2009, 75, 245–248. [Google Scholar] [CrossRef] [Green Version]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012; pp. 352–354. ISBN 978-0-262-01802-9. [Google Scholar]
Sedgewick, R.; Wayne, K. Algorithms, 4th ed.; Addison-Wesley Professional Press: Boston, MA, USA, 2011; ISBN 978-0321573513. [Google Scholar]
Sagonas, C.; Antonakos, E.; Tzimiropulos, G.; Zafeiriou, S.; Pantic, M. 300 faces in-the-wild challenge: Database and results. Image Vis. Comuting 2016, 47, 3–18. [Google Scholar] [CrossRef]
C++11 Implementation of the Supervised Descent Optimization Method. Available online: https://github.com/patrikhuber/superviseddescent (accessed on 19 August 2018).
2D-FAN. Available online: https://www.adrianbulat.com/face-alignment/ (accessed on 19 August 2018).
Sagonas, C.; Tzimiropoulos, G.; Zafeiriou, S.; Pantic, M. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013. [Google Scholar]

Figure 1. Face alignment example with the supervised descent method (SDM) algorithm [6]. Firstly, the face region is automatically detected in the image [11]. Then, features are localized within the detected face region. (Image courtesy of [12]).

Figure 2. Definition of the 68 facial landmarks. (Image courtesy of [38]).

Figure 3. The flow chart of the training algorithm of SDM.

Figure 4. Typical example of the estimated probability density function (PDF) for the elements in A₁.

Figure 5. The flow chart of the proposed tightly-coupled training algorithm.

Figure 6. The diagram of the compressed data storage arrangement.

Figure 7. Normalized mean error loss against the standard SDM.

Figure 8. Face alignment results for different methods on the 300 W dataset [12]. The left column shows the results with our compressed training model. The center column shows the results by using the SDM algorithm with an uncompressed training model [6]. The right column shows the results with the 2D-FAN [22].

Figure 9. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on the 300 W test set [12].

Figure 10. Normalized mean errors of our compressed training model SDM compared with the uncompressed training model SDM [6] and deep learning based method [22] on a subset of the 300 W test dataset [12] whose yaw angles are between 0 and 30 degrees.

Figure 11. Comparison of the cumulative error distribution curves between our compressed training model SDM and the uncompressed training model SDM [6] on the 300 W test set [12].

Figure 12. Normalized mean errors of our proposed method versus the method without the tightly-coupled training algorithm.

Figure 13. Normalized mean errors of our proposed method versus the method without K-means clustering.

Figure 14. Sample AR (Augmented Reality) effects of the mobile application using our proposed compressed model version of SDM.

Table 1. Data ranges for a typical HoG-based 6-step regressor.

Step Index	A_d		b_d
Step Index	Minimum Value	Maximum Value	Minimum Value	Maximum Value
1	−0.0072	0.0075	−0.2597	0.1381
2	−0.0081	0.0075	−0.1774	0.1308
3	−0.0050	0.0060	−0.1093	0.0644
4	−0.0040	0.0042	−0.0680	0.0400
5	−0.0032	0.0037	−0.0295	0.0283
6	−0.0024	0.0031	−0.0180	0.0157

Table 2. Comparisons of normalized mean errors of our proposed method versus the method with random initialization for the K-means algorithm.

Face Parts	Our Proposed Method	The Method with Random Initialization (100 Trials)
Face contour	0.0427	0.0568 ± 0.0047
Eyebrows	0.0407	0.0559 ± 0.0038
Eyes	0.0287	0.0401 ± 0.0032
Nose	0.0272	0.0396 ± 0.0035
Mouth	0.0359	0.0512 ± 0.0052

Table 3. The normalized mean error loss against the standard SDM for different choices of nb.

nb	The Normalized Mean Error Loss against the Standard SDM
4	0.0241
5	0.0109
6	0.00274
7	0.00253
8	0.00235
9	0.00212
10	0.00193
11	0.00180
12	0.00167
13	0.00151
14	0.00134
15	0.00123
16	0.00121

Table 4. The normalized mean error loss against the standard SDM for different choices of D.

D	1	2	3	4	5	6	7	8
The normalized mean error loss against the standard SDM	0.0122	0.00441	0.00372	0.00325	0.00291	0.00274	0.00269	0.00263

Table 5. Comparisons of average scores for the two methods.

Index	Average Score of Uncompressed Training Model SDM [6]	Average Score of Our Compressed Model Method
1	4.10	4.05
2	3.80	3.70
3	4.15	4.10
4	3.95	4.00
5	4.40	4.45
6	3.85	3.75

Table 6. Extra computational cost for each part of the algorithm in the tightly-coupled training process.

Module of the Algorithm	Average Time (ms)
Probability density-aware initialization	12.3
K-means clustering	256.7
Data quantization	9.4

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, Y.; Jiang, Q.; Wang, B.; Zhu, Q.; Yang, W. Tightly-Coupled Data Compression for Efficient Face Alignment. Appl. Sci. 2018, 8, 2284. https://doi.org/10.3390/app8112284

AMA Style

Shen Y, Jiang Q, Wang B, Zhu Q, Yang W. Tightly-Coupled Data Compression for Efficient Face Alignment. Applied Sciences. 2018; 8(11):2284. https://doi.org/10.3390/app8112284

Chicago/Turabian Style

Shen, Yehu, Quansheng Jiang, Bangfu Wang, Qixin Zhu, and Wenming Yang. 2018. "Tightly-Coupled Data Compression for Efficient Face Alignment" Applied Sciences 8, no. 11: 2284. https://doi.org/10.3390/app8112284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tightly-Coupled Data Compression for Efficient Face Alignment

Abstract

Featured Application

Abstract

1. Introduction

2. Basics of SDM

3. Tightly-Coupled Data Compression Algorithm

3.1. K-Means Based Data Quantization with Probability Density-Aware Initialization

3.2. Tightly-Coupled Training Algorithm

3.3. Compressed Model Data Storage Arrangement

4. Methodology

5. Results

5.1. The Choice of Q

5.2. Qualitative Experimental Results

5.3. Quantitative Experimental Results

5.4. Ablation Study

5.4.1. Effect of the Tightly-Coupled Training Algorithm

5.4.2. Effect of Probability Density-Aware Initialization

5.4.3. Effect of K-Means Clustering

5.5. Parameter Sensitivity Analysis

5.6. User Study for AR Mobile Applications

5.7. Computational Cost Analysis

6. Discussion

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI