Next Article in Journal
The Impact of Two-Sided Market Platforms on Participants’ Trading Strategies: An Evolutionary Game Analysis
Next Article in Special Issue
Combining the Taguchi Method and Convolutional Neural Networks for Arrhythmia Classification by Using ECG Images with Single Heartbeats
Previous Article in Journal
Advancing COVID-19 Understanding: Simulating Omicron Variant Spread Using Fractional-Order Models and Haar Wavelet Collocation
Previous Article in Special Issue
Remote Sensing Imagery Object Detection Model Compression via Tucker Decomposition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LCA-GAN: Low-Complexity Attention-Generative Adversarial Network for Age Estimation with Mask-Occluded Facial Images

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(8), 1926; https://doi.org/10.3390/math11081926
Submission received: 20 February 2023 / Revised: 9 April 2023 / Accepted: 17 April 2023 / Published: 19 April 2023

Abstract

:
Facial-image-based age estimation is being increasingly used in various fields. Examples include statistical marketing analysis based on age-specific product preferences, medical applications such as beauty products and telemedicine, and age-based suspect tracking in intelligent surveillance camera systems. Masks are increasingly worn for hygiene, personal privacy concerns, and fashion. In particular, the acquisition of mask-occluded facial images has become more frequent due to the COVID-19 pandemic. These images cause a loss of important features and information for age estimation, which reduces the accuracy of age estimation. Existing de-occlusion studies have investigated masquerade masks that do not completely occlude the eyes, nose, and mouth; however, no studies have investigated the de-occlusion of masks that completely occlude the nose and mouth and its use for age estimation, which is the goal of this study. Accordingly, this study proposes a novel low-complexity attention-generative adversarial network (LCA-GAN) for facial age estimation that combines an attention architecture and conditional generative adversarial network (conditional GAN) to de-occlude mask-occluded human facial images. The open databases MORPH and PAL were used to conduct experiments. According to the results, the mean absolution error (MAE) of age estimation with the de-occluded facial images reconstructed using the proposed LCA-GAN is 6.64 and 6.12 years, respectively. Thus, the proposed method yielded higher age estimation accuracy than when using occluded images or images reconstructed using the state-of-the-art method.

1. Introduction

In general, age, gender, expression, and race can be derived from facial appearances [1]. Age estimation is being increasingly used in diverse fields, such as statistical marketing analysis based on age-specific product preferences, medical fields such as the beauty industry and telemedicine, and age-based suspect tracking in intelligent surveillance camera systems [2]. Despite continuous efforts in research, including design of age estimation algorithms and models, data collection, system performance tests, and valid evaluation protocols, improving the accuracy of age estimation remains a challenge [3]. Age estimation is challenging because the human face is influenced by internal factors (size, wrinkles, shape, texture, race, etc.) and external factors (health, dietary habits, culture, environment, etc.), and these change over time through complex processes [4]. However, there are general and common features that can explain human facial aging [5]. Age estimation is divided into feature representation, extraction, and age learning stages. Though previous studies used various handcrafted feature-based methods, they require accurate prior knowledge of experts. However, there are no methods to verify the accuracy of the prior knowledge [6]. Unlike conventional methods, a convolutional neural network (CNN) can extract clear and robust facial features and learn the age on its own [7]. Generally, a CNN for age estimation consists of a convolutional layer and a multi-layer perceptron (MLP). The convolutional layer extracts and represents age information features from the facial images, and the MLP estimates the age with the represented features. Then, the distance between the estimated age and the age label is calculated by the loss function and then backpropagations are performed. The whole process is performed automatically, and unlike conventional methods that rely on prior knowledge, it can utilize information that cannot be extracted with prior knowledge. It provides better age estimation performance than traditional methods, and numerous researchers are actively conducting research on improving the results and accuracy.
Despite the advances in age estimation research, there are several problems when age estimation is applied practically. Facial images obtained in real unrestricted environments frequently have problems that degrade image quality related to resolution, illumination, noise, and occlusion. In particular, mask-occluded facial images have recently become frequent following the COVID-19 pandemic. Masks are worn more frequently because of hygiene, personal privacy concerns, and fashion. Mask-occluded facial images do not contain important features and information for age estimation, which reduces the accuracy of age estimation. Existing de-occlusion research has investigated masquerade masks that do not completely occlude the eyes, nose, and mouth [8,9,10,11], but no studies have investigated the de-occlusion of masks that completely occlude the nose and mouth, which is the goal of this study. To this end, this study proposes a novel low-complexity attention-generative adversarial network (LCA-GAN) for facial age estimation that combines an attention architecture and conditional generative adversarial network (conditional GAN) to de-occlude mask-occluded human facial images. Our innovation is for facial image de-occlusion. The present study is new compared to previous studies in four ways:
  • This is the first study of its kind on age estimation that considers the de-occlusion of facial images where the nose and mouth are completely occluded by a mask;
  • We propose a novel LCA-GAN for mask de-occlusion. LCA-GAN contains low-complexity attention blocks (LCABs) that reduce computation and complexity by combining down and upsampling with the attention module. LCAB comprises low-complexity channel attention (LCCA) and low-complexity spatial attention (LCSA), and it uses attention to assign weights based on the importance of features in channel and spatial dimensions;
  • To reconstruct the facial feature information lost by mask occlusion as much as possible in de-occlusion, edge loss and content loss in LCA-GAN were used;
  • The trained LCA-GAN and CNN for age estimation and experimental mask generated facial images were published [12], enabling a fair comparison with the performance of other researchers.
The structure of this paper is as follows. Section 2 analyzes existing age estimation studies that have used facial images and de-occlusion methods. Section 3 explains the overall experimental method and LCA-GAN, the de-occlusion network proposed in this paper. Section 4 presents a comparison of the performance and age estimation results using de-occluded images between existing de-occlusion methods and LCA-GAN based on the MORPH and PAL databases. As a final section, Section 5 concludes the paper.

2. Related Works

Facial images contain biological information with diverse attributes, such as race, gender, age, environment, and lifestyle. In [13], the distributions of these attributes and their averages were analyzed, and a method was presented to evaluate the bias of appropriate algorithms and databases. It has influenced various studies using human facial images [14,15,16]. However, it is difficult to investigate handcrafted feature-based age estimation, which requires consideration of diverse factors. Consequently, most age estimation studies have used CNNs since the emergence of deep learning. As shown in Table 1, age estimation methods are generally classified into five categories [6]. Multi-class classification yields high age estimation performance when using limited resources and data with a single model. Hybrid methods, which combine multiple methods, supplement the shortcomings of combined models and yield high age estimation performance for large quantities of data in environments with many available computing resources. To measure the age estimation accuracy, researchers have used mean absolute error (MAE) [17], exact accuracy [18], 1-off [19], and normal score (ϵ-error) [20], among which MAE is the most commonly used.
As listed in Table 1, age estimation studies have used images obtained in restricted environments, such as the MORPH [39] and FG-NET [40] databases, and those from unrestricted environments such as IMDB-WIKI [41], Adience [42], LAP2015 [43], LAP2016 [44], and CACD [45]. MORPH is a database of human facial mugshot images with resolutions ranging from 640 × 480 to 1024 × 768 pixels. It contains various attributes, such as gender, age, and race, and is acquired in a restricted environment such as image resolution, illumination, and pose. FG-NET is a database of human facial images with gender and age information. The database collected facial images that satisfy specific conditions such as image resolution and pose from pictures of people with confirmed ages. In this case, the facial images that satisfy the conditions are similar to the facial images acquired in a restricted environment. On the other hand, the databases in unrestricted environment are collected from the internet, magazines, films, etc. These unrestricted databases contain facial images in natural poses, various image resolutions and illumination changes, and occlusions with various objects. Age estimation research using images from restricted environments has yielded relatively low age estimation accuracy for occluded images. Despite the difficulty of age estimation due to occlusion, previous age estimation studies have not considered the de-occlusion of facial images occluded by a mask that completely covers the nose and mouth. In addition, all datasets in Table 1 have a very small number of masked face images. Therefore, for our experiments, we generated a large number of masked face images from the MORPH and PAL databases. Table 2 presents a comparison of studies that do and do not consider face occlusion with the proposed method.
A study [30] proposed recurrent age estimation (RAE), which combines inception-v4 [48] and long short-term memory networks (LSTM) [49]. It extracts features from facial images using inception-v4 and learns individual aging patterns with LSTM. To solve the problem of training overfitting, researchers have proposed label distribution learning (LDL), which uses the ambiguity between the label age and predict age. A study [38] proposed mixed attention-ShuffleNet-v2 (MA-SFV2), which combines mixed attention and ShuffleNet-v2 [50]. Classification, regression, and distribution methods were simultaneously applied to learning to transform the output layer of the base model. In [10], CNN2ELM, an ensemble model combining CNN and extreme learning machine (ELM), was proposed for leaning age. It achieved decent results in ChaLearn 2016 [51], a human age estimation competition. In [8], a deep expectation of apparent age (DEX) system was proposed, which uses the softmax expected value refinement of the VGG-16-based network [21]. More specifically, they defined the regression problem of age learning as a classification problem and performed age estimation by multiplying the age label and the class probability distribution, which is the last softmax output of VGG-16. A study [11] proposed the divide-and-rule architecture and AgeNet, a model based on the method of GoogleNet [52]. AgeNet is a feature extractor and divide-and-rule approach to age learning as an ordinal regression problem. As such, research on age estimation includes studies using restricted environment images [30,39,47,48], studies using unrestricted environment images [9,11], and studies using images of both environments [9,21,38]. This age estimation research includes many studies that excluded occlusion in restricted environments or ignored occlusion in unrestricted environments, so de-occlusion was not considered.
However, face occlusion frequently occurs in the real world and is challenging to solve through camera hardware. Most existing de-occlusion methods reconstruct synthesized images because of the lack of databases with non-occluded and occluded image pairs [53,54,55,56]. In [53], a two-stage occlusion-aware GAN was proposed, trained with 44 images occluded by sunglasses, hats, scarves, and phones with a random shape, location, and size in a face database, but it did not use facial mask images. This method removes occluded areas with the existing Pix2pix-based [57] GAN architecture and de-occludes the image only using information from non-occluded areas. However, this is an unrealistic occlusion condition, and while de-occlusion is successful for most objects, it fails for glasses and sunglasses. Previous research proposed a method to solve the face recognition problem associated with block occlusion of facial images by two robust feature-based representations, which were designed to fit the errors to a distribution described by a tailored loss function and the reduced rank structure of the errors relative to the image size [58]. The work in [59] proposed MRGAN, a GAN-based two-level network that removed the areas occluded by medical masks and reconstructed the removed areas. Stage one detected masks, and stage two performed de-occlusion. This method is based on a complex network and is computationally intensive. This experiment achieved good de-occlusion results but did not reconstruct color and detail information well in the occluded area. The work in [60] proposed a two-stage GAN-based method for de-occlusion of small objects in facial image such as microphones. Steps one and two were trained similarly to a conventional GAN, but the de-occluded image from step one was used as an input to step two to generate a robust de-occluded image. The de-occlusion results showed that the texture and detail were de-occluded well, but the outline of the occluded object remained.
Moreover, previous study [54] presented a two-stage method; in stage one, the occluded area is detected with an encoder–decoder structure and converted to a mask image as a pre-processing method, and in stage two, the occluded area is transformed through two conditional GAN [61] architectures. For supervised learning, face images were collected from the CelebA database, and the collected images were synthesized with the collected mask images occluding the eyes using Photoshop CC 2080 [54]. This method requires additional binary mask images to be trained, and the de-occlusion results fail on complex and detailed mask images. Another study [55] proposed Swap-R&R that compensates for the lack of paired databases. This method shows very robust de-occlusion performance with facial images occluded by glasses, sunglasses, makeup, and headsets, but it requires an additional 3D face reconstruction network in training and testing, and the network computation is very large [55]. In [56], they reconstructed identity-preserved and de-occluded facial images by CNN, which was supervised with identity labels. By using the additional channel for occlusion detection, a mask for occlusion is computed as a pre-processing method and combined with the reconstructed face. This experiment collected frontal face images between −45 and 45 degrees from the CASIA-WebFace database. The collected face images are used to synthesize multiple objects for supervised learning. There are more than 100 templates for occluding objects such as masks, glasses, and hands [56], and among them, objects such as glasses and masks require accurate position information. Objects that require such location information are synthesized directly using an image program, while other objects are synthesized randomly. However, they only deal with grayscale facial images and produce results including artifacts. Though de-occlusion research has been conducted on masquerade masks that do not completely occlude the eyes, nose, and mouth [54], no studies have investigated de-occlusion-based age estimation of masks that completely occlude the nose and mouth, which is the goal of this study. To solve the aforementioned problems, this study proposes LCA-GAN-based facial mask de-occlusion and an age estimation method using this technique.

3. Proposed Methods

3.1. Overview of Suggested Method

Figure 1 illustrates the entire procedure of robust age estimation for mask-occluded facial images proposed in this study.
Stages one and two entail pre-processing. First, the positions of the face and eyes are detected, and based on the detected positions, we execute the compensation of in-plane rotation, and the face region of interest (ROI) is re-defined. Following pre-processing, the mask-occluded image is de-occluded using the proposed LCA-GAN. The CNN-based age estimation method is then adopted to predict the age of the person in the de-occluded image.

3.2. Pre-Processing

As shown in Figure 2, the input images are processed. For pre-processing, this study used the dlib facial feature tracker [62], which extracted features with histogram of oriented gradients (HOG) and trained a linear classifier on the extracted information. It does not use any parameters or thresholds except for the upsample_num_time option, which is used for iterating the detection while scaling the input image, and we used the default value of one in our experiment. It was used on original face images of different sizes to detect face landmarks and the face box region. First, the dlib facial feature tracker for facial feature points [62] is used to locate the positions of the eyes, as shown in Figure 2b. Using Equation (1) based on the positions of both detected eyes, in-plane rotation compensation is performed, as shown in Figure 2c. In-plane rotation compensation is performed based on the center position between both eyes (green dot between the eyebrows in Figure 2b).
θ = t a n 1 R y L y R x L x
( R x , R y ) and ( L x , L y ) represent the x- and y-axis positions of the centers of the detected right and left eyes, respectively. In the in-plane rotation compensated image, the face region is found using the dlib facial feature tracker, as shown in Figure 2c, and this face region becomes the face ROI, as shown in Figure 2d. To use the face ROI as an input for LCA-GAN, it is resized to 256 × 256 × 3 by bilinear interpolation.

3.3. De-Occlusion of Masked Facial Image by LCA-GAN

General image quality problems that distort image information include low resolution, blur, low illumination, noise, etc. However, regarding the occlusion problem, the occluding object reduces image information. Hence, unlike image quality problems where information can be directly extracted from the distorted area, for the occlusion problem, it is more difficult to directly extract information from the occluded area. Consequently, it is difficult to learn the mapping from occluded to de-occluded images using a CNN. To solve this problem, this study proposes LCA-GAN based on adversarial learning. Figure 3 shows the architecture of LCA-GAN. The generator is based on U-net [63]; by using LCAB, which combines up and downsampling with channel and spatial attention, it reduces complexity and computation. For the discriminator, a patch discriminator is used to output with a size of 30 × 30 × 1. It obtains probabilities for each of the 30 × 30 patches and determines whether each of the 900 local patch areas is a real or fake image. Finally, it averages all 900 probabilities to finally determine if the entire global image is real or fake.

3.3.1. Generator

Figure 3a shows the generator for de-occlusion in this study, which uses the U-net [63] architecture comprising an encoder–decoder and skip connection. In the encoder–decoder used for U-net’s continuous down and upsampling, the encoder extracts features, and the decoder learns the mapping for image patches corresponding to the extracted features. Additionally, it concatenates high-stage encoder block features with low-stage decoder block features to compensate for lost high-level information and to balance high- and low-level information. However, this architecture is inefficient for the occlusion problem, where it is difficult to directly extract information from the occluded area. To solve this problem, this study proposes LCAB, which combines an attention mechanism [64] with down and upsampling. LCAB is composed of LCCA and LCSA; attention is used to assign weights according to the importance of features in the channel and spatial dimensions. The next subsection describes LCAB architecture in detail. When de-occluding the occluded area, the generator uses edge loss to create a detailed and sharp image as well as content loss to maintain the information of the non-occluded area and the texture in the target image. In this de-occlusion process, L 1 loss function was used for identity loss to maintain the information of the original image. Table 3 presents the overall architecture of the generator. As shown in Table 3, an image of 256 × 256 × 3 including the mask-occluded area is inputted to the generator of LCA-GAN, and a feature map of 256 × 256 × 64 is obtained via Convolution layer 1 and Spatial Attention. Then, by passing through LCAB 1~5 (including LCCA and LCSA), the feature map of 8 × 8 × 512 is obtained as the final output of encoder. This feature map is again passing through the decoder of LCAB 6~10 (including LCCA, Concatenation, and LCSA except for LCAB 10 including only LCCA and LCSA) and the upsampled feature map of 256 × 256 × 64 is obtained. Then, this feature map is passing through Convolution layer 2 and Tanh activation layer, and the final generated (mask-de-occluded) image of 256 × 256 × 3 is obtained as the output of decoder.

3.3.2. The Structure of LCAB

The attention mechanism abstracts the importance between the modalities of the represented features to incorporate high-level information. Representative attention methods used in images include spatial attention and channel attention [64]. Channel and spatial attention determine which features are important in the channel and spatial dimensions, respectively, and assign corresponding weights. In this study, channel and spatial attention were used to detect and de-occlude mask-occluded areas. Moreover, we proposed LCAB, which combines down and upsampling with the attention process to reduce the complexity and computation that increases due to attention and continuous processes after down or upsampling. LCAB comprises LCCA and LCSA, the structures of which are illustrated in Figure 4 and Figure 5.
Channel attention has two convolutional layers, average pooling, three multi-layered perceptrons (MLPs), a sigmoid layer, multiplication with convolutional layer features, and addition with input features passed through skip-connection. The importance of channel dimension features is arranged using the two-stage convolution layer to re-represent the input-represented features. The features are then compressed by global average pooling in the spatial space, and MLP is used to calculate the importance of modalities between the features in the channel space. Here, the first and third layers of MLP are equal to the channel dimension size of the input features, and the second layer is 1/4 the channel dimension size of the input features. Using sigmoid activation, attention is constructed from the importance of these arranged features. Following the convolutional layer, it is multiplied with the features and then added to the input features with high-level information. This study proposes LCCA, which combines down and upsampling with channel attention. In the two-stage convolutional layer, the first stage uses a 4 × 4 size filter to scale the spatial space by 2 in upsampling and by 1/2 in downsampling. The second stage uses a 3 × 3 size filter to maintain the size of the spatial space. Moreover, the deformation of high-level information is minimized using bilinear interpolation, and the skip-connection of the input features is passed.
As shown in Figure 5, spatial attention is composed of two convolutional layers, global average pooling, a convolutional layer, a sigmoid layer, multiplication with convolutional layer features, and addition with input features passed through skip-connection. The importance of spatial dimension features is arranged using the two-stage convolution layer to re-represent the input-represented features. The features are then compressed by global average pooling in the channel space, and the importance of modalities between the features in the spatial space is calculated through the convolutional layer. Using sigmoid activation, attention is created from the importance of these arranged features. Following the convolutional layer, it is multiplied with the features and then added to the input features with high-level information. This study proposes LCSA, which combines down and upsampling with spatial attention. The two-stage convolutional layer maintains the spatial space using a 3 × 3 size filter but adjusts the channel dimension to the desired size. After global average pooling, the convolutional layers use a 7 × 7 size filter to calculate the importance of modalities between features in the spatial space and then represent them as probability values using the sigmoid activation layer. Additionally, the size of the channel space of the input features is adjusted using bilinear interpolation, the deformation of high-level information is minimized, and the information is passed. Through LCCA and LCAS, channel and spatial attention preserve the original goal of representing the importance of the represented features from the perspectives of “what” and “where”, combining the down and upsampling processes. Moreover, they reduce computation and complexity caused by the use of continuous down and upsampling and attention mechanisms, and they are effective for the de-occlusion process.

3.3.3. Discriminator

LCA-GAN uses a patch discriminator. It receives a fixed input image and a random target image and output image, and then it concatenates them. Convolution is then performed, and each grid of extracted features has a receptive field according to the computational structure. In this study, individual grids of 1 × 1 × 1 units of the final output, which has a size of 30 × 30 × 1, have a receptive field of 70 × 70, use individual grids to determine the local area, and then distinguish the global area with the average of all grids. In this process, identity loss and content loss are used to maintain the continuity of information in the input image. Table 4 lists the detailed structure of the discriminator.
LCA-GAN proposed in this study performs learning using pairs of images comprising the mask-occluded image (input image) and original un-occluded image (target image). Conditional GAN [57] learns the mapping using the loss function in Equation (2), which receives an input image I I n and generates an output image I O u t that is similar to the target image I T a r g e t .
L G A N G ,   D = E I I n , I T a r g e t l o g D I I n , I T a r g e t + E I I n l o g ( 1 D I I n , G I I n )
This adversarial learning method generates smooth pixel-wise images [65]. In this study, the generator used content loss to maintain the texture of the original face when de-occluding the mask-occluded area. For this purpose, the L 2 loss function of Equation (3) is applied to the VGG-16 (pre-trained with ImageNet) to measure the dissimilarity of I O u t and I T a r g e t .
L c o n t = E I O u t , I T a r g e t I T a r g e t I O u t 2
The proposed LCA-GAN preserves the area other than the mask-occluded area in the mask-occluded image and performs de-occlusion. The discriminator concatenates pairs of I I n and I O u t or I I n and I T a r g e t , which enables learning to reinforce high-frequency information in the mask-occluded area of I T a r g e t . This is accomplished by applying the edge loss function in Equation (4) to generate a detailed de-occluded image. Δ denotes the Laplasian operation, and ε is 10−3.
L E d g e = E I O u t , I T a r g e t Δ I T a r g e t Δ I O u t 2 + ε 2
Rather than learning the data distribution of I T a r g e t , adversarial learning sometimes strongly tends toward receiving the discriminator’s determination of the real image. To prevent this and preserve the identity of the image, we added identity loss, which uses the L 1 loss function, as shown in Equation (5).
L I d e n = E I O u t , I T a r g e t I T a r g e t I O u t
Finally, our final loss function is in Equation (6). Using the training data, 1.5, 2, and 2 were determined as the optimal values of λ 1 , λ 2 , and λ 3 , respectively, to obtain the best age estimation accuracy.
L L C A = arg m i n G m a x D L G A N G , D + λ 1 L c o n t + λ 2 L E d g e + λ 3 L I d e n

3.4. Age Estimator

The DEX model [8] was used to predict the age of de-occluded facial images with LCA-GAN, which exhibited good results in the Looking at People (LAP) 2015 [66] competition and previous research results on age estimation accuracy [67]. DEX is an age estimation model based on VGG-16 [68], an existing classification network. For age estimation, VGG16 pre-trained on ImageNet was additionally pre-trained using the IMDB and WIKI databases. Since human aging typically involves sequential changes over time, the similarity between adjacent classes in DEX is high, so the probability of the trained model is considered to have a normal distribution. Therefore, rather than estimating the class label showing the highest probability score as age, the age was predicted as the product of the class label and probability value, as shown in Equation (7).
E s t i m a t e d   a g e I = 1 n l i p i
where I denotes the input facial image, n indicates the number of classes, l i denotes the ith class label, and p i corresponds to the ith output probability value. The detailed age estimation methods of DEX are illustrated in Figure 6.
DEX [8] is a VGG-16-based network with a convolution layer, which consists of a convolution filter, batch normalization, an ReLU activation function, two MLP layers with 4096 nodes, and an output MLP layer equal to the size of the age class labels. The convolutional layers extract features with age information and represent the features for age learning. MLP layers learn age from the represented features. The last MLP layer outputs a probability value through a softmax activation function. Finally, to improve the age estimation performance, DEX applies the age estimation method described in Equation (7). In this experiment, we used the L 2 loss function in DEX to emphasize the relation of close age classes, thus making the age estimation method using Equation (7) more robust.

4. Experimental Results

4.1. Data and Environment for Experiments

As shown in Figure 7, this study used MORPH [39] and PAL [69] as the databases to de-occlude mask-occluded facial images. Given the lack of open databases of mask-occluded facial images obtained in real environments that include existing age information, as shown in Figure 8, we generated mask-occluded facial images using mask images without a background image directly acquired from MORPH and PAL, which are existing human facial databases. Figure 8a shows the original facial image, and Figure 8b shows the mask image with no background. Subsequently, in the facial image, the dlib facial feature tracker [62] explained in Section 3.2 detects the eye area, as shown in Figure 8c. Using the center position of the eyes, the method described in Section 3.2 is used to perform in-plane rotation, as shown in Figure 8d; in this image, the dlib tracker for facial feature points finds the position of the face landmark corresponding to the annotated point of the mask image. Based on this position information, the annotated mask image is geometrically transformed and warped, as shown in Figure 8e. As shown in Figure 8f, the transformed mask image is then occluded on the aligned facial image, and the face ROI is detected again using the dlib tracker. The ROI is re-defined with the detected face ROI, and the final face ROI image of the mask-occluded image is created, as shown in Figure 8g.
The experiments were performed with two-fold cross validation, and in each fold, we used 5% of the training images as a validation set. To locate the face ROI, we used the python library (version 3.5.2) [70] and OpenCV (version 4.2.0) [71]. The specification of the desktop computer for our experiments is as follows: 3.5 GHz CPU (Intel® Core™ i7-3770K) and 24 GB RAM, Windows Tensorflow (version 2.2.0) [72], and Nvidia graphics processing unit (GPU) card (Nvidia GeForce GTX 1070 [73]).

4.2. Training of LCA-GAN for Masked Image De-Occlusion and CNN for Age Estimation

LCA-GAN proposed in this study performs learning using the mask-occluded facial image as the input image and the original facial image without a mask as the target image. During training, through online augmentation, the input images were resized to 286 × 286 × 3 and then randomly cropped to 256 × 256 × 3. The adaptive moment estimation (Adam) optimizer [74] was used during training, with a learning rate of 0.0002, beta_1 of 0.5, and beta_2 of 0.999. Training was conducted for 100 epochs; Figure 9 shows the training and validation loss graphs of the generator and discriminator of LCA-GAN. It is evident that the generator and discriminator converged, indicating that the training data were sufficiently learned. For validation loss, the results of the generator and discriminator converged, indicating that LCA-GAN was not overfitted to the training data. In the case of GAN, mode collapse usually occurs when a generator tries to map an input (training data set) to the same output (generator function). The discriminator and generator should be learning together and interacting with each other, but one becomes too well trained (learning imbalance), and mode collapse occurs [75]. As shown in Figure 9, the discriminator becomes too well trained compared to the generator in our experiment, which is usually the case for conventional GAN [75,76,77]. Therefore, although the mouth areas are different in the input images, those in the generated output images are somewhat similar, which represents a small level of mode collapse. However, other areas of face in the generated output images were different according to our LCA-GAN, which confirms that overfitting and mode collapse were not severe in our LCA-GAN.
The slow convergence of the validation loss of the generator in Figure 9 was due to the insufficient number of MORPH databases used. In addition, considering the age labels ranging from 16 to 77 years old, various ethnicities, and unbalanced gender ratio in the MORPH database, the distribution of the data is very complex and unbalanced. These factors create a condition where the loss of validation data using only 5% of the training data can lead to late convergence. In this case, the difficulty of learning generally increases, and overfitting, underfitting, and divergence of losses are likely to occur. We described our training and validation sets according to the distribution of our experimental database in Table 5 of Section 4.2. As shown in Table 5, there exists a class imbalance in some age ranges, races, and gender. To solve this problem, we increased the training and validation data by data augmentation that included translation, cropping, and horizontal flipping for classes whose numbers of images were much smaller than those of other classes (e.g., the images of ages from 56~65 and 66~77). From that, we could make all the images of each class distributed equally by gender, age, and race for the training and validation. In our experiment, the convergence of the validation loss was slightly delayed, as shown in Figure 9, but the result was well learned without overfitting the training data.
The reason why the validation loss of the discriminator increased after 10 epochs is as follows. The Adam optimizer used in this experiment has various advantages, but it has the disadvantages of poor conditioning problems and slow initial learning speed that depends on the size of the database, the value of the hyperparameter, and the loss function used [74]. In Figure 9, the slightly slower convergence speed of the generator loss is likely due to the initially slow learning speed of the Adam optimizer, while the relatively fast convergence of the discriminator loss is due to the fact that it is relatively easier to learn than the generator [76]. However, the brief increase in discriminator loss is a result of the poor conditioning problem mentioned above. The first and second moments used by the Adam optimizer are the mean of the sample mean and sample square of the input data, respectively. In this experiment, mean squared error (MSE) is used as the content loss function, which causes a poor conditioning problem in the second moment during back-propagation. Several studies have proposed methods to solve this problem, and in our paper, we applied L_2 regularization [78], used the largest possible batch size in the experimental environment [79], used a learning rate of 0.0002, which is smaller than the 0.001 usually used for the Adam optimizer [80], and applied a weight decay every 10 epochs [78]. Therefore, the discriminator loss in Figure 9 increases slightly after 10 epochs, decreases again after 50 epochs, and gradually converges, which can be seen as a good response to the problem of the Adam optimizer in our experiment.
Subsequently, the images de-occluded using LCA-GAN were learned by DEX [8], an age estimation CNN model. The same random cropping outlined above was applied through online augmentation, and learning was conducted for 200 epochs using the Adam optimizer, with a learning rate of 0.0002, beta_1 of 0.5, and beta_2 of 0.999 [74]. Figure 10 illustrates the training loss and accuracy graphs of DEX as well as the validation loss and accuracy graphs of DEX. The convergence of the training loss and accuracy graphs demonstrates that the DEX age estimator was sufficiently trained on the de-occluded training data generated by LCA-GAN. Moreover, the convergence of the validation loss and accuracy graphs demonstrates that the DEX age estimator was not overfitted to the de-occluded training data generated by LCA-GAN.

4.3. Testing with MORPH Database

4.3.1. Comparisons of the Quality of Images Generated by Proposed Method and State-of-the-Art Methods

To compare the performance of the de-occlusion model for mask-occluded facial images in this experiment with other models, the structural similarity index measure (SSIM) [81] and peak signal-to-noise ratio (PSNR) [82] were used to measure the similarity between the original image and the generated de-occluded image. SSIM is expressed in Equation (9), and PSNR is expressed in Equation (10). Larger values of both SSIM and PSNR indicate better performance of the de-occlusion model.
MSE = 1 W H i = 0 W 1 j = 0 H 1 I o i ,   j I d i ,   j 2
SSIM = 2 μ d μ o + C 1 2 σ d o + C 2 μ d 2 + μ o 2 + C 1 σ d 2 + σ o 2 + C 2  
PSNR = 10 l o g 10 255 2 MSE
I o represents the original image, and I d represents the mask de-occluded image. Moreover, W and H denote the width and height of the image, respectively. μ o and σ o indicate the mean and standard deviation of the pixel values of the original image, respectively. μ d and σ d indicate the mean and standard deviation of the pixel values of a mask de-occluded image, respectively, and μ d o denotes the two images’ covariance. C 1 and C 2 correspond to the positive constant offsets.
As presented in Table 6, Pix2pix [57] and MPRNet [83] yielded the best performance for SSIM and PSNR according to the de-occlusion results, while the proposed LCA-GAN exhibit the fourth and third highest performance for SSIM and PSNR, respectively. Nevertheless, SSIM and PSNR are values that represent the image quality according to de-occlusion; the primary goal of this study is to improve age estimation accuracy (not image quality) through de-occlusion. As presented in Section 4.3.2 and Section 4.4.2, the proposed LCA-GAN yielded the highest age estimation accuracy.

4.3.2. Comparisons of Age Estimation Accuracy

Ablation Studies

As shown in Equation (11), we used the mean absolute error (MAE), the most frequently applied metric [85,86], to evaluate age estimation accuracy. A lower MAE value shows better performance of age estimation.
MAE = 1 n i = 1 n p i y i  
In the above equation, n denotes the number of images, p i denotes the predicted age, and y i denotes the ground-truth age. For the first ablation study, we compared the age estimation accuracy of the de-occluded images depending on the application of LCCA and LCSA, which form LCAB, and edge and content losses in the proposed LCA-GAN. According to the results in Table 7, the proposed LCA-GAN yields the highest age estimation performance in the generated de-occluded images when using LCCA, LCSA, and edge and content losses, respectively.
We performed an experiment to conduct an additional ablation study, as shown in Table 8. We compare the age estimation accuracy when applying various backbone models as the LCA-GAN generator. The last row in Table 8 is the method that subtracts the original image and masked image, concatenates the input image with the image where only the occluded area remains, and uses this input image in Pix2pix learning. Evidently, the best age estimation performance was achieved when using the Pix2pix backbone generator in LCA-GAN. In addition, we performed additional comparisons of the accuracy of the state-of-the-art age estimation method in Table 8. As shown in Table 8, the MAEs with original non-occluded and mask-occluded face images by the state-of-the-art age estimation method were 5.80 years (baseline 1) and 10.45 years (baseline 2) years, respectively. Although the MAE by our LCA-GAN is 6.64 years, it is much lower than that with mask-occluded face images without our LCA-GAN (baseline 2), which confirms the effectiveness of our proposed LCA-GAN.
Figure 11 shows examples of the generated images according to the ablation study in Table 8. In Table 8, the age estimation performance of baseline 1 using non-occluded facial images and baseline 2 using mask-occluded facial images (occluding nose and mouth) were 5.80 and 10.45, respectively, which is a difference of 4.65 years, indicating that the nose and mouth have important information for age estimation. In addition, the areas in the facial image that have important information for age estimation are shown, with significant activation in the nose and mouth. Figure 11a,b show the masked images and original images, and Figure 11c–f display the de-occluded images processed in the order in Table 9. According to the results in Figure 11, the proposed LCA-GAN using only Pix2pix yielded the best de-occlusion performance. The de-occlusion networks including Pix2pix and CycleGAN used in Table 8 and Figure 11, except for U-net, are adversarial networks using U-net as a generator. These adversarial learning-based de-occlusion methods generate robust and realistic de-occluded facial images by capturing complex patterns, but age information is somewhat lost due to unnecessary deformation in non-occluded areas of the eyes. On the other hand, CNN-based U-net is trained with a pixel-wise loss function and has less deformation in the non-occluded facial area of the eyes, so it preserves age information well. However, learning the mapping from input image to target image in the mask-occluded area is difficult and generates blurred images. Consequently, adversarial learning-based methods are weak at preserving information in non-occluded areas of the eyes and are strong at de-occlusion, while U-net is strong at preserving information in non-occluded areas of the eyes and is weak at de-occlusion. The areas with significant age information are shown in the human face image. The areas with high activation are the eyes, nose, and mouth, where the eyes are a non-occluded area and the nose and mouth are occluded areas. As a result, U-net preserved the age information in the non-occluded area of the eyes well, but the non-occluded areas were smaller than the whole face area, and the consequent de-occlusion performance was lower than the other methods because it cannot restore the age information.

Comparisons of our LCA-GAN with Existing Methods

This subsection compares the proposed method with state-of-the-art methods. MPRNet is a three-stage model with an iterative structure that receives input images through the multi-scale approach [83]. MPRNet* in Table 9 is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet. According to the experimental results listed in Table 9, LCA-GAN, the de-occlusion network proposed in this study, yielded the best age estimation performance.
Figure 12 shows examples of mask de-occluded images obtained using the proposed LCA-GAN and state-of-the-art methods. Figure 12a displays masked facial images created by the method described in Section 4.1, and Figure 12b shows the original facial images. De-occluded images are shown by (c) the proposed LCA-GAN, (d) AFD-Stack GAN, (e) CFR-GAN, (f) MRPNet, (g) CycleGAN, and (h) Pix2pix. As shown in Figure 12, the mask de-occluded image generated by LCA-GAN is the nearest to the original image.

4.4. Testing with PAL Database

4.4.1. Comparisons of the Quality of Images Generated by Proposed Method and the State-of-the-Art Methods

We performed additional experiments using the open database PAL to confirm the generality of the proposed LCA-GAN performance. As presented in Table 10, CFR-GAN [55] and AFD-StackGAN [54] exhibited the best performance for SSIM and PSNR, respectively, while the proposed LCA-GAN yielded the third and fourth highest performance for SSIM and PSNR, respectively. However, SSIM and PSNR are values that represent image quality according to de-occlusion, but the primary goal of this study is to improve age estimation accuracy (not image quality) through de-occlusion. According to a comparison of age estimation accuracy in Table 11, the proposed LCA-GAN yielded the highest accuracy.

4.4.2. Comparisons of Age Estimation Accuracy by Our LCA-GAN and the Existing Methods

For the next experiment, we de-occluded images using LCA-GAN and compared the age estimation accuracy using DEX. MPRNet* in Table 9 is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet. As shown in Table 11, LCA-GAN, the de-occlusion network proposed in this study, yielded the best age estimation performance. In addition, we performed additional experiments using different age estimation methods after the same use of LCA-GAN. As shown in Table 12, DEX showed the best accuracy among all the different age estimation methods.
Figure 13 illustrates examples of mask-de-occluded images obtained by the proposed LCA-GAN and the state-of-the-art methods. As evidenced in Figure 13, the mask de-occluded image generated by LCA-GAN is the closest to the original image.

4.5. Processing Speed

In this subsection, we measured and compared the processing times of the proposed LCA-GAN and state-of-the-art methods in the desktop environment described in Section 4.1 and a Jetson TX2 board [89], as shown in Figure 14. Table 13 lists the measured processing times. LCA-GAN yielded a faster processing speed in the desktop environment and embedded environment than all state-of-the-art methods, except Pix2pix [57]. Furthermore, the processing speed did not greatly differ from Pix2pix [57]. This indicates that the proposed LCA-GAN can be operated even in an embedded system environment with limited computing resources. Table 14 compares the number of parameters, giga floating point operations per second (GFLOPs), and memory usage between the proposed LCA-GAN and state-of-the-art methods. LCA-GAN exhibited the smallest number of parameters, second lowest GFLOPs, and third lowest memory usage compared to the state-of-the-art methods. However, as indicated in Table 9 and Table 11, the proposed LCA-GAN yielded the best age estimation performance compared to the previous methods.

4.6. Discussion

In this subsection, we present the extraction and analysis of the attention map of the attention module used in the LCA-GAN de-occlusion process (Figure 15) and the gradient class activation map (Grad-CAM) [90] of DEX used for age estimation (Figure 16). Figure 15 displays the (a) original facial images and (b) the mask-occluded facial images. Figure 15c–e show the attention maps of LCAB 1, LCAB 3, and LCAB 5 of the Table 3 encoder, respectively, and Figure 15f–h show the attention maps of LCAB 6, LCAB 8, and LCAB 10 of the Table 3 decoder, respectively. The attention maps show that in the proposed LCA-GAN, as the encoder de-occlusion progresses, attention is activated from the entire face area to the mask area. Subsequently, as the decoder de-occlusion progresses, high activation is shown in the detailed areas of major facial elements, such as the eyes, nose, mouth, and chin.
Subsequently, we examined the Grad-CAM images of DEX, the age estimation network used in this experiment. Figure 16 shows (a) the original images, (b) the mask-occluded images, and (c) the de-occluded images. In Figure 15d–g, the Grad-CAM images of DEX’s 4th, 8th, and 11th convolution layers and the last max pooling layers are overlapped with the mask de-occluded images. As illustrated in Figure 16, as DEX learns to estimate the age from mask de-occluded facial images, in Grad-CAM, which showed high activation for high-frequency information in large areas of the image, elements such as the eyes, nose, mouth, and the surrounding textures are activated with the deepening of layers.
Figure 17 shows examples of de-occluded images that were incorrectly generated by the proposed LCA-GAN. These incorrectly generated de-occluded images can be attributed to several problems: first, the difference in convergence speed between the generator and discriminator during adversarial learning; second, the use of a single generator; and finally, the class imbalance according to gender and race in the learning images as well as non-uniform lighting in the test images.

5. Conclusions

Mask-occluded images that occur in real environments cause a loss of information required for age estimation, thereby degrading age estimation performance. This study proposed a novel de-occlusion network LCA-GAN. Through experiments using MORPH and PAL, open databases of human facial images, the proposed network achieved higher age estimation performance than existing state-of-the-art de-occlusion networks. Furthermore, the proposed LCA-GAN contains 57,118,684 parameters, which is fewer than existing methods. This indicates that it can be operated even in an embedded system with limited computing resources. Moreover, from the attention maps in LCA-GAN and Grad-CAM images of DEX for images de-occluded with LCA-GAN, LCA-GAN and DEX effectively extracted features for de-occlusion and age estimation, respectively. However, as shown in Figure 17, LCA-GAN occasionally incorrectly generated de-occluded images.
To solve this, it is necessary to research solutions for several problems: the difference in convergence speed between the generator and discriminator during adversarial learning, the use of a single generator, the class imbalance according to gender and race in the learning images, and non-uniform lighting in the test images. Moreover, we will investigate solutions for cases where mask occlusion simultaneously occurs with other factors, such as low light and image blurring. Furthermore, we will research a shallower model to achieve faster processing speeds in an embedded platform.

Author Contributions

Methodology, S.H.N.; supervision, K.R.P.; validation, Y.H.K., J.C. and C.P.; writing—original draft, S.H.N.; writing—review and editing, K.R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (MSIT) through the Basic Science Research Program (NRF-2021R1F1A1045587), in part by the NRF funded by the MSIT through the Basic Science Research Program (NRF-2022R1F1A1064291), in part by the MSIT, Korea, under the Information Technology Research Center (ITRC) support program (IITP-2023-2020-0-01789) supervised by the IITP (Institute for Information and Communications Technology Planning and Evaluation), and in part by the National Supercomputing Center with supercomputing resources including technical support (TS-2023-RE-0025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gallagher, A.C.; Chen, T. Estimating age, gender, and identity using first name priors. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  2. Angulu, R.; Tapamo, J.R.; Adewumi, A.O. Age estimation via face images: A survey. EURASIP J. Image Video Process. 2018, 2018, 42. [Google Scholar] [CrossRef]
  3. Wang, X.; Guo, R.; Kambhamettu, C. Deeply-learned feature for age estimation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 534–541. [Google Scholar]
  4. Farkas, J.P.; Pessa, J.E.; Hubbard, B.; Rohrich, R.J. The Science and Theory behind Facial Aging. Plast. Reconstr. Surg. Glob. Open 2013, 1, e8–e15. [Google Scholar] [CrossRef]
  5. Albert, A.M.; Ricanek, K., Jr.; Patterson, E. A review of the literature on the aging adult skull and face: Implications for forensic science research and applications. Forensic Sci. Int. 2007, 172, 1–9. [Google Scholar] [CrossRef]
  6. Olatunbosun, A.-A.; Serestina, V. Deep learning approach for facial age classification: A survey of the state-of-the-art. Artif. Intell. Rev. 2020, 54, 179–213. [Google Scholar]
  7. Antipov, G.; Baccouche, M.; Berrani, S.-A.; Dugelay, J.-L. Apparent age estimation from face images combining general and children-specialized deep learning models. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 801–809. [Google Scholar]
  8. Rothe, R.; Timofte, R.; Van Gool, L. Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks. Int. J. Comput. Vis. 2018, 126, 144–157. [Google Scholar] [CrossRef]
  9. Agbo-Ajala, O.; Viriri, S. Face-based age and gender classification using deep learning model. In Proceedings of the 2019 International Workshops, Sydney, NSW, Australia, 18–22 November 2019; pp. 125–137. [Google Scholar]
  10. Duan, M.; Li, K.; Li, K. An ensemble CNN2ELM for age estimation. IEEE Trans. Inf. Forensic Secur. 2018, 13, 758–772. [Google Scholar] [CrossRef]
  11. Liao, H.; Yan, Y.; Dai, W.; Fan, P. Age Estimation of Face Images Based on CNN and Divide-and-Rule Strategy. Math. Probl. Eng. 2018, 2018, 1712686. [Google Scholar] [CrossRef]
  12. LCA-GAN with Algorithm. (Model and Algorithm to Be Uploaded on Github). Available online: https://github.com/nsh6473/LCA-GAN/ (accessed on 1 February 2023).
  13. Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
  14. Hiba, S.; Keller, Y. Hierarchical attention-based age estimation and Bias estimation. arXiv 2021, arXiv:2103.09882. [Google Scholar]
  15. Nimhed, C. Estimation of Height, Weight, Sex and Age from Magnetic Resonance Images Using 3D Convolutional Neural Networks. Master’s Thesis, Linköping University, Linköping, Sweden, 2022; pp. 1–60. [Google Scholar]
  16. Yaman, D.; Eyiokur, F.I.; Ekenel, H.K. Multimodal soft biometrics: Combining ear and face biometrics for age and gender classification. Multimedia Tools Appl. 2021, 81, 22695–22713. [Google Scholar] [CrossRef]
  17. Onifade, O.F.W.; Akinyemi, J.D. A GW ranking approach for facial age estimation. Egypt. Comput. Sci. J. 2014, 38, 63–74. [Google Scholar]
  18. Levi, G.; Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 11–12 June 2015; pp. 34–42. [Google Scholar]
  19. Chen, J.-C.; Kumar, A.; Ranjan, R.; Patel, V.M.; Alavi, A.; Chellappa, R. A cascaded convolutional neural network for age estimation of unconstrained faces. In Proceedings of the IEEE 8th International Conference on Biometrics Theory, Applications and Systems, Niagara Falls, NY, USA, 6–9 September 2016; pp. 1–8. [Google Scholar]
  20. Zhu, Y.; Li, Y.; Mu, G.; Guo, G. A study on apparent age estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 11–12 December 2015; pp. 267–273. [Google Scholar]
  21. Rothe, R.; Timofte, R.; Gool, L.V. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 11–12 December 2015; pp. 252–257. [Google Scholar]
  22. Agustsson, E.; Timofte, R.; Escalera, S.; Baro, X.; Guyon, I.; Rothe, R. Apparent and real age estimation in still images with deep residual regressors on appa-real database. In Proceedings of the 12th IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 30 May–3 June 2017; pp. 87–94. [Google Scholar]
  23. Anand, A.; Labati, R.D.; Genovese, A.; Munoz, E.; Piuri, V.; Scotti, F. Age estimation based on face images and pre-trained convolutional neural networks. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence, Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–7. [Google Scholar]
  24. Aydogdu, M.F.; Demirci, M.F. Age classification using an optimized CNN architecture. In Proceedings of the International Conference on Compute and Data Analysis, Lakeland, FL, USA, 19–23 May 2017; pp. 233–239. [Google Scholar]
  25. Zhang, K.; Gao, C.; Guo, L.; Sun, M.; Yuan, X.; Han, T.X.; Zhao, Z.; Li, B. Age Group and Gender Estimation in the Wild With Deep RoR Architecture. IEEE Access 2017, 5, 22492–22503. [Google Scholar] [CrossRef]
  26. Ranjan, R.; Zhou, S.; Chen, J.C.; Kumar, A.; Alavi, A.; Patel, V.M.; Chellappa, R. Unconstrained age estimation with deep convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop, Santiago, Chile, 7–13 December 2015; pp. 351–359. [Google Scholar]
  27. Niu, Z.; Zhou, M.; Wang, L.; Gao, X.; Hua, G. Ordinal regression with multiple output CNN for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4920–4928. [Google Scholar]
  28. Li, W.; Lu, J.; Feng, J.; Xu, C.; Zhou, J.; Tian, Q. Bridgenet: A continuity-aware probabilistic network for age estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1145–1154. [Google Scholar]
  29. Gao, B.B.; Zhou, H.Y.; Wu, J.; Geng, X. Age estimation using expectation of label distribution learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 712–718. [Google Scholar]
  30. Zhang, K.; Liu, N.; Yuan, X.; Guo, X.; Gao, C.; Zhao, Z.; Ma, Z. Fine-Grained Age Estimation in the Wild With Attention LSTM Networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3140–3152. [Google Scholar] [CrossRef]
  31. Chen, S.; Zhang, C.; Dong, M.; Le, J.; Rao, M. Using ranking-CNN for age estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 742–751. [Google Scholar]
  32. Liu, W.; Chen, L.; Chen, Y. Age Classification Using Convolutional Neural Networks with the Multi-class Focal Loss. IOP Conf. Series: Mater. Sci. Eng. 2018, 428, 012043. [Google Scholar] [CrossRef]
  33. Liu, H.; Lu, J.; Feng, J.; Zhou, J. Ordinal Deep Learning for Facial Age Estimation. IEEE Trans. Circuits Syst. Video Technol. 2017, 29, 486–501. [Google Scholar] [CrossRef]
  34. Gurpinar, F.; Kaya, H.; Dibeklioglu, H.; Salah, A.A. Kernel ELM and CNN based facial age estimation. In Proceedings of the 2016 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 785–791. [Google Scholar]
  35. Liu, K.-H.; Yan, S.; Kuo, C.-C.J. Age Estimation via Grouping and Decision Fusion. IEEE Trans. Inf. Forensics Secur. 2015, 10, 2408–2423. [Google Scholar] [CrossRef]
  36. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
  37. Duan, M.; Li, K.; Yang, C.; Li, K. A hybrid deep learning CNNELM for age and gender classification. Neurocomputing 2018, 275, 448–461. [Google Scholar] [CrossRef]
  38. Liu, X.; Zou, Y.; Kuang, H.; Ma, X. Face Image Age Estimation Based on Data Augmentation and Lightweight Convolutional Neural Network. Symmetry 2020, 12, 146. [Google Scholar] [CrossRef]
  39. MORPH Database. Available online: https://ebill.uncw.edu/C20231_ustores/web/store_main.jsp?STOREID=4 (accessed on 17 May 2022).
  40. FGNET Database. Available online: https://yanweifu.github.io/FG_NET_data/index.html (accessed on 17 May 2022).
  41. IMDB Database. Available online: https://www.imdb.com/interfaces/ (accessed on 17 May 2022).
  42. Adience Database. Available online: https://talhassner.github.io/home/projects/Adience/Adience-data.html/ (accessed on 17 May 2022).
  43. LAP 2015 Database. Available online: https://chalearnlap.cvc.uab.cat/dataset/18/description/ (accessed on 17 May 2022).
  44. LAP 2016 Database. Available online: https://chalearnlap.cvc.uab.cat/dataset/19/description/ (accessed on 17 May 2022).
  45. CACD Database. Available online: https://bcsiriuschen.github.io/CARC/ (accessed on 17 May 2022).
  46. Guo, G.; Mu, G. Human age estimation: What is the influence across race and gender. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 71–78. [Google Scholar]
  47. Chen, K.; Gong, S.; Xiang, T.; Change Loy, C. Cumulative attribute space for age and crowd density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2467–2474. [Google Scholar]
  48. Szegedy, C.; Loffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv 2016, arXiv:1602.07261v2. [Google Scholar] [CrossRef]
  49. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  50. Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar] [CrossRef]
  51. Looking at People CVPR Challenge—Track1: Age Estimation. Available online: https://chalearnlap.cvc.uab.cat/challenge/13/description/ (accessed on 17 May 2022).
  52. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  53. Dong, J.; Zhang, L.; Zhang, H.; Liu, W. Occlusion-aware GAN for face de-occlusion in the wild. In Proceedings of the IEEE International Conference on Multimedia and Expo, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
  54. Jabbar, A.; Li, X.; Assam, M.; Khan, J.A.; Obayya, M.; Alkhonaini, M.A.; Al-Wesabi, F.N.; Assad, M. AFD-StackGAN: Automatic nask generation network for face de-occlusion using StackGAN. Sensors 2022, 22, 1747. [Google Scholar] [CrossRef]
  55. Ju, Y.-J.; Lee, G.-H.; Hong, J.-H.; Lee, S.-W. Complete Face Recovery GAN: Unsupervised joint face rotation and de-occlusion from a single-view image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3711–3721. [Google Scholar]
  56. Zhao, F.; Feng, J.; Zhao, J.; Yang, W.; Yan, S. Robust LSTM-Autoencoders for Face De-Occlusion in the Wild. IEEE Trans. Image Process. 2017, 27, 778–790. [Google Scholar] [CrossRef] [PubMed]
  57. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
  58. Iliadis, M.; Wang, H.; Molina, R.; Katsaggelos, A.K. Robust and Low-Rank Representation for Fast Face Identification With Occlusions. IEEE Trans. Image Process. 2017, 26, 2203–2218. [Google Scholar] [CrossRef] [PubMed]
  59. Din, N.U.; Javed, K.; Bae, S.; Yi, J. A Novel GAN-Based Network for Unmasking of Masked Face. IEEE Access 2020, 8, 44276–44287. [Google Scholar] [CrossRef]
  60. Khan, M.K.J.; Din, N.U.; Bae, S.; Yi, J. Interactive Removal of Microphone Object in Facial Images. Electronics 2019, 8, 1115. [Google Scholar] [CrossRef]
  61. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
  62. Dlib C++ Library. Available online: http://dlib.net/ (accessed on 17 May 2022).
  63. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  64. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  65. Jiang, J.; Wang, C.; Liu, X.; Ma, J. Deep Learning-based Face Super-resolution: A Survey. ACM Comput. Surv. 2021, 55, 1–36. [Google Scholar] [CrossRef]
  66. Looking at People ICCV Challenge—Track1: Age Estimation. Available online: https://chalearnlap.cvc.uab.cat/challenge/12/description/ (accessed on 17 May 2022).
  67. Nam, S.H.; Kim, Y.H.; Choi, J.; Hong, S.B.; Owais, M.; Park, K.R. LAE-GAN-Based Face Image Restoration for Low-Light Age Estimation. Mathematics 2021, 9, 2329. [Google Scholar] [CrossRef]
  68. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
  69. PAL database. Available online: http://agingmind.utdallas.edu/download-stimuli/face-database/ (accessed on 17 May 2022).
  70. Python. Available online: https://www.python.org/ (accessed on 1 October 2019).
  71. OpenCV. Available online: http://opencv.org (accessed on 1 October 2022).
  72. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467v2. [Google Scholar]
  73. NVIDIA GeForce GTX 1070. Available online: https://www.nvidia.com/en-in/geforce/products/10series/geforce-gtx-1070/ (accessed on 21 April 2022).
  74. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Sandiego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
  75. Goodfellow, I. NIPS 2016 Tutorial: Generative adversarial networks. arXiv 2017, arXiv:1701.00160v4. [Google Scholar]
  76. Salimans., T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; Chen, X. Improved techniques for training GANs. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2234–2242. [Google Scholar]
  77. Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–17. [Google Scholar]
  78. Loshchilov, I.; Hutter, F. Fixing weight decay regularization in ADAM. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–4. [Google Scholar]
  79. Smith, S.L.; Kindermans, P.-J.; Ying, C.; Le, Q.V. Don’t decay the learning rate increase the batch size. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–11. [Google Scholar]
  80. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. arXiv 2019, arXiv:1904.09237v1. [Google Scholar]
  81. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  82. Antkowiak, J.; Baina, T.J. Final Report from the Video Quality Experts Group on the Validation of Objective Models of Video Quality Assessment. ITU-T Standards Contribution COM, 2000. Available online: https://www.vqeg.org/publications-and-software/ (accessed on 17 May 2022).
  83. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
  84. Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
  85. Sharma, N.; Sharma, R.; Jindal, N. Face-Based Age and Gender Estimation Using Improved Convolutional Neural Network Approach. Wirel. Pers. Commun. 2022, 124, 3035–3054. [Google Scholar] [CrossRef]
  86. Zhang, B.; Bao, Y. Age Estimation of Faces in Videos Using Head Pose Estimation and Convolutional Neural Networks. Sensors 2022, 22, 4171. [Google Scholar] [CrossRef] [PubMed]
  87. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  88. Liu, X.; Li, S.; Kan, M.; Zhang, J.; Wu, S.; Liu, W.; Han, H.; Shan, S.; Chen, X. Agenet: Deeply learned regressor and classifier for robust apparent age estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 16–24. [Google Scholar]
  89. Jetson TX2 Module. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems-dev-kits-modules/ (accessed on 15 September 2022).
  90. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Figure 1. The entire procedure of our LCA-GAN-based de-occlusion and age estimation.
Figure 1. The entire procedure of our LCA-GAN-based de-occlusion and age estimation.
Mathematics 11 01926 g001
Figure 2. Exemplary process of face ROI definition. (a) Example image of original MORPH database. (b) Detected face and eyes region using dlib facial feature tracker. (c) In-plane rotated facial image with detected face region. (d) Re-defined face ROI image.
Figure 2. Exemplary process of face ROI definition. (a) Example image of original MORPH database. (b) Detected face and eyes region using dlib facial feature tracker. (c) In-plane rotated facial image with detected face region. (d) Re-defined face ROI image.
Mathematics 11 01926 g002
Figure 3. LCA-GAN architecture. (a) Generator. (b) Discriminator.
Figure 3. LCA-GAN architecture. (a) Generator. (b) Discriminator.
Mathematics 11 01926 g003
Figure 4. Structure of LCCA (AvgPool and MLP mean average pooling and multi-layered perceptron, respectively).
Figure 4. Structure of LCCA (AvgPool and MLP mean average pooling and multi-layered perceptron, respectively).
Mathematics 11 01926 g004
Figure 5. Structure of LCSA.
Figure 5. Structure of LCSA.
Mathematics 11 01926 g005
Figure 6. The overall procedure of age estimation with DEX.
Figure 6. The overall procedure of age estimation with DEX.
Mathematics 11 01926 g006
Figure 7. Examples of (a) MORPH and (b) PAL database.
Figure 7. Examples of (a) MORPH and (b) PAL database.
Mathematics 11 01926 g007
Figure 8. Procedure of mask inclusion of facial images. (a) Original facial image, (b) original mask image, (c) eyes detected using dlib tracker, (d) in-plane rotation compensated image with detected jawlines, (e) warped mask image using geometric transformation, (f) mask occlusion image and detected face ROI, and (g) re-defined face ROI image of mask occlusion image.
Figure 8. Procedure of mask inclusion of facial images. (a) Original facial image, (b) original mask image, (c) eyes detected using dlib tracker, (d) in-plane rotation compensated image with detected jawlines, (e) warped mask image using geometric transformation, (f) mask occlusion image and detected face ROI, and (g) re-defined face ROI image of mask occlusion image.
Mathematics 11 01926 g008
Figure 9. Graphs of LCA-GAN losses. Blue and violet lines represent the generator training and validation loss graphs, respectively. Red and green lines represent the discriminator training and validation loss graphs, respectively.
Figure 9. Graphs of LCA-GAN losses. Blue and violet lines represent the generator training and validation loss graphs, respectively. Red and green lines represent the discriminator training and validation loss graphs, respectively.
Mathematics 11 01926 g009
Figure 10. Graphs of DEX loss and accuracy. Blue and red lines represent the training loss and accuracy graphs, respectively. Violet and green lines represent the validation loss and accuracy graphs, respectively.
Figure 10. Graphs of DEX loss and accuracy. Blue and red lines represent the training loss and accuracy graphs, respectively. Violet and green lines represent the validation loss and accuracy graphs, respectively.
Mathematics 11 01926 g010
Figure 11. Examples of de-occluded mask images. (a) Masked images and (b) original images. De-occluded images are shown using (c) U-net structure, (d) CycleGAN structure, (e) Pix2pix structure (LCA-GAN), and (f) Pix2pix* (a method that subtracts the original image and masked image, concatenates the input image with the image where only the occluded area remains, and uses this input image in Pix2pix learning).
Figure 11. Examples of de-occluded mask images. (a) Masked images and (b) original images. De-occluded images are shown using (c) U-net structure, (d) CycleGAN structure, (e) Pix2pix structure (LCA-GAN), and (f) Pix2pix* (a method that subtracts the original image and masked image, concatenates the input image with the image where only the occluded area remains, and uses this input image in Pix2pix learning).
Mathematics 11 01926 g011aMathematics 11 01926 g011b
Figure 12. Examples of mask de-occluded images. (a) Masked images and (b) original images without a mask. De-occluded images are shown by using (c) the proposed LCA-GAN, (d) AFD-Stack GAN, (e) CFR-GAN, (f) MRPNet, (g) CycleGAN, and (h) Pix2pix.
Figure 12. Examples of mask de-occluded images. (a) Masked images and (b) original images without a mask. De-occluded images are shown by using (c) the proposed LCA-GAN, (d) AFD-Stack GAN, (e) CFR-GAN, (f) MRPNet, (g) CycleGAN, and (h) Pix2pix.
Mathematics 11 01926 g012aMathematics 11 01926 g012b
Figure 13. Examples of mask-de-occluded images. (a) Masked images and (b) original images without a mask. De-occluded images are shown by (c) the proposed LCA-GAN, (d) AFD-Stack GAN, (e) CFR-GAN, (f) MRPNet, (g) CycleGAN, and (h) Pix2pix.
Figure 13. Examples of mask-de-occluded images. (a) Masked images and (b) original images without a mask. De-occluded images are shown by (c) the proposed LCA-GAN, (d) AFD-Stack GAN, (e) CFR-GAN, (f) MRPNet, (g) CycleGAN, and (h) Pix2pix.
Mathematics 11 01926 g013aMathematics 11 01926 g013b
Figure 14. Jetson TX2 board.
Figure 14. Jetson TX2 board.
Mathematics 11 01926 g014
Figure 15. LCA-GAN attention map. (a) Original image, (b) mask-occluded image, (ce) attention maps from Table 3 encoder LCAB 1, LCAB 3, and LCAB 5, and (fh) attention maps from Table 3 decoder LCAB 6, LCAB 8, and LCAB 10.
Figure 15. LCA-GAN attention map. (a) Original image, (b) mask-occluded image, (ce) attention maps from Table 3 encoder LCAB 1, LCAB 3, and LCAB 5, and (fh) attention maps from Table 3 decoder LCAB 6, LCAB 8, and LCAB 10.
Mathematics 11 01926 g015aMathematics 11 01926 g015b
Figure 16. Grad-CAM images from DEX with LCA-GAN. (a) Original image, (b) mask-occluded images, (c) de-occluded image, (dg) overlapped Grad-CAM images extracted from the 4th, 8th, 11th convolution layers as well as the last pooling layers of DEX, respectively.
Figure 16. Grad-CAM images from DEX with LCA-GAN. (a) Original image, (b) mask-occluded images, (c) de-occluded image, (dg) overlapped Grad-CAM images extracted from the 4th, 8th, 11th convolution layers as well as the last pooling layers of DEX, respectively.
Mathematics 11 01926 g016aMathematics 11 01926 g016b
Figure 17. Bad case images generated by LCA-GAN. (a) Original images and (b) de-occluded images by LCA-GAN.
Figure 17. Bad case images generated by LCA-GAN. (a) Original images and (b) de-occluded images by LCA-GAN.
Mathematics 11 01926 g017
Table 1. Classification and comparison of existing age estimation methods.
Table 1. Classification and comparison of existing age estimation methods.
CategoriesMethodDatabaseMAEExact1-Offϵ-Error
Classification of multi-class agesDEX [21]IMDB-WIKI + LAP20153.22N.A.N.A.0.26
Residual DEX [22]LAP20154.45N.A.
Dimensionality reduction + FFNNs [23]WIKI + AmI-Face + Adience3.30
4C2FC [24]MORPHN.A.46.39
RoR [25]IMDB-WIKI + Adience67.397.51
DEX [8]MORPH2.68N.A.N.A.
FG-NET3.09
CACD6.52
IMDB-WIKI + LAP2015N.A.64.0
AdienceN.A.96.60.26
4C2FC + dropout [9]AdienceN.A.84.889.7N.A.
Regression based on metrics3NNR [26]Adience + MORPH + LAP2015N.A.N.A.N.A.0.37
OR-CNN [27]AFAD
MORPH
3.34
3.27
N.A.
VGG + BridgeNet [28]MORPH
FG-NET
LAP2015
2.38
2.56
2.98
0.26
Learning by the distribution of deep labelDLDL-v2 [29]LAP2015
LAP2016
MORPH
3.14
3.45
1.97
0.272
0.267
N.A.
Inception v4 [30]MORPH
FG-NET
1.32
2.19
RankingRanking-CNN [31]MORPH2.96
ODFL + OHRank [32]MORPH
FG-NET
LAP2016
Adience
3.12
3.89
4.12
N.A.
N.A.
0.34
54.088.2N.A.
ODL [33]MORPH
FG-NET
LAP2016
2.92
3.71
3.95
N.A.N.A.N.A.
0.312
Hybrid methodsKernel ELM + CNN [34]LAP2016N.A.0.37
MRCNN [35]MORPH3.48N.A.
GA-DFL [36]MORPH
FG-NET
LAP2015
3.25
3.93
4.21
N.A.
N.A.
0.37
CNN + ELM [37]MORPH
Adience
3.44
N.A.
N.A.
52.3
N.A.
RAGN [10]IMDB-WIKI + MORPH
IMDB-WIKI + Adience
IMDB-WIKI + LAP2016
2.61
N.A.
N.A.
N.A.
66.5
N.A.
N.A.
0.37
AgeNet + divide and rule [11]FG-NET
MORPH
IMDB-WIKI
4.02
3.48
3.29
N.A.N.A.
MA-ShuffleNet v2 [38]MORPH
FG-NET
2.68
3.81
Table 2. Comparison of the strengths and weaknesses of existing research and proposed methods in age estimation based on consideration of face occlusion.
Table 2. Comparison of the strengths and weaknesses of existing research and proposed methods in age estimation based on consideration of face occlusion.
CategoriesAge Learning TechniqueMethodStrengthWeakness
Age estimation without
considering face occlusion
Handcrafted feature-basedGuo et al. [46]Age estimation robust to restricted environmentThey did not consider face-occluded images for age estimation
Chen et al. [47]
Deep feature-basedInception v4 [30]
MA-ShuffleNet v2 [38]
Age estimation with
considering face occlusion
DEX [8]Age estimation robust to occluded facial imagesThey trained simultaneously with occlusion and non-occlusion images, which made network convergence difficult
AgeNet + divide and rule [11]
RAGN [10]
4C2FC + dropout [9]
Proposed methodAdditional procedures are required to train LCA-GAN
Table 3. Generator architecture in LCA-GAN.
Table 3. Generator architecture in LCA-GAN.
LayerSize of FeatureConcatenation
Input image256 × 256 × 3-
EncoderConvolution layer 1256 × 256 × 64-
Spatial attention256 × 256 × 64-
LCAB 1LCCA
LCSA
128 × 128 × 64
128 × 128 × 128
-
LCAB 2LCCA
LCSA
64 × 64 × 128
64 × 64 × 256
-
LCAB 3LCCA
LCSA
32 × 32 × 256
32 × 32 × 512
-
LCAB 4LCCA
LCSA
16 × 16 × 512
16 × 16 × 512
-
LCAB 5LCCA
LCSA
8 × 8 × 512
8 × 8 × 512
-
DecoderLCAB 6LCCA
Concatenation
LCSA
16 × 16 × 512
16 × 16 × 1024
16 × 16 × 512
LCAB4
LCAB 7LCCA
Concatenation
LCSA
32 × 32 × 512
32 × 32 × 1024
32 × 32 × 512
LCAB3
LCAB 8LCCA
Concatenation
LCSA
64 × 64 × 512
64 × 64 × 768
64 × 64 × 256
LCAB2
LCAB 9LCCA
Concatenation
LCSA
128 × 128 × 256
128 × 128 × 384
128 × 128 × 128
LCAB1
LCAB 10LCCA
LCSA
256 × 256 × 128
256 × 256 × 64
-
Convolution layer 2
Tanh activation layer
256 × 256 × 3-
Generated image 256 × 256 × 3
Table 4. Discriminator architecture in LCA-GAN. CL and BN mean convolution layer and batch normalization, respectively.
Table 4. Discriminator architecture in LCA-GAN. CL and BN mean convolution layer and batch normalization, respectively.
LayerSize of Feature
Input image256 × 256 × 3
Target or de-occluded image256 × 256 × 3
Concatenate256 × 256 × 6
CL 1Convolution
BN
ReLU
128 × 128 × 64
 
 
CL 2Convolution
BN
ReLU
64 × 64 × 128
 
 
CL 3Convolution
BN
ReLU
32 × 32 × 256
 
 
CL 4Zero padding
Convolution
BN
Leaky ReLU
34 × 34 × 256
31 × 31 × 512
 
 
CL 5Zero padding
Convolution
Sigmoid
Average pooling
33 × 33 × 512
30 × 30 × 1

1 × 1 × 1
Output Real or fake
Table 5. Distribution of our experimental database for two-fold cross validation.
Table 5. Distribution of our experimental database for two-fold cross validation.
GenderAgeWhiteBlackHispanicAsianOtherTotal
TrainingMale16~2595258344014447235
26~3586441842311355296
36~451090430092345490
46~55579197324472588
56~6590268200360
66~7771600023
Total358216,574750632020,990
Female16~2528371020501017
26~3536776119021149
36~453958186051224
46~55106275001383
56~65182500145
66~77110002
Total1169259146693820
ValidationMale16~252121296891011608
26~3519293051311177
36~4524295620111220
46~55129439512575
56~65206000080
66~77240005
Total79636831671444665
Female16~2563158410226
26~3582169401255
36~4588182101272
46~55246100085
56~654600010
66~77000001
Total2605761012849
TotalMale 796136,83216671414446,645
Female 2598575710213198489
Table 6. Comparative SSIM and PSNR of original image and de-occluded images.
Table 6. Comparative SSIM and PSNR of original image and de-occluded images.
LCA-GANAFD-StackGAN [54]CFR-GAN [55]MPRNet
[83]
CycleGAN
[84]
Pix2pix
[57]
SSIM0.69620.67690.71070.70310.56300.7225
PSNR
(unit: dB)
19.030216.312118.306719.642719.332118.8731
Table 7. Ablation study with or without LCSA, LCCA, and Edge and Content losses in our LCA-GAN (unit: years).
Table 7. Ablation study with or without LCSA, LCCA, and Edge and Content losses in our LCA-GAN (unit: years).
LCSALCCAEdge Loss + Content LossMAE
×××7.72
××7.11
××7.09
××7.83
×6.82
6.64
Table 8. Ablation study according to various backbone generators in LCA-GAN. Baselines 1 and 2 show the MAEs with original non-occluded and mask-occluded face images by DEX, respectively. Baseline 2 shows the MAE with original non-occluded face images by DEX. (Pix2pix* indicates the method that subtracts the original image and masked image, concatenates the input image with the image where only the occluded area remains, and uses this input image in Pix2pix learning) (unit: years).
Table 8. Ablation study according to various backbone generators in LCA-GAN. Baselines 1 and 2 show the MAEs with original non-occluded and mask-occluded face images by DEX, respectively. Baseline 2 shows the MAE with original non-occluded face images by DEX. (Pix2pix* indicates the method that subtracts the original image and masked image, concatenates the input image with the image where only the occluded area remains, and uses this input image in Pix2pix learning) (unit: years).
MethodMAE
Baseline 15.80
Baseline 210.45
U-net7.70
Pix2pix
(LCA-GAN)
6.64
CycleGAN7.15
Pix2pix*6.91
Table 9. Comparative accuracies of age estimation by LCA-GAN and various de-occlusion methods (MPRNet* is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet) (unit: years).
Table 9. Comparative accuracies of age estimation by LCA-GAN and various de-occlusion methods (MPRNet* is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet) (unit: years).
LCA-GANAFD-StackGAN [54]CFR-GAN
[55]
MPRNet
[83]
MPRNet* [83]Pix2pix
[57]
CycleGAN [84]
MAE6.646.927.136.957.837.728.18
Table 10. Comparative SSIM and PSNR of original image and de-occluded images.
Table 10. Comparative SSIM and PSNR of original image and de-occluded images.
LCA-GANAFD-StackGAN [54]CFR-GAN [55]MPRNet
[83]
Pix2pix
[57]
CycleGAN
[84]
SSIM0.70420.69830.72070.70020.71340.6892
PSNR18.330219.442318.304318.974219.421117.9443
Table 11. Comparative accuracies of age estimation by LCA-GAN and various de-occlusion methods (MPRNet* is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet) (unit: years).
Table 11. Comparative accuracies of age estimation by LCA-GAN and various de-occlusion methods (MPRNet* is a two-stage model that reduces one stage using the input image with the smallest size in MPRNet) (unit: years).
LCA-GANAFD-StackGAN [54]CFR-GAN
[55]
MPRNet
[83]
MPRNet* [83]Pix2pix
[57]
CycleGAN [84]
MAE6.126.946.528.218.707.129.02
Table 12. Comparative accuracies of different age estimation methods after the same use of LCA-GAN (unit: years).
Table 12. Comparative accuracies of different age estimation methods after the same use of LCA-GAN (unit: years).
VGG-16
[68]
ResNet-50
[87]
ResNet-152
[87]
DEX
[8]
AgeNet
[11,88]
Inception with
Random Forest [20]
MAE6.207.226.326.126.196.42
Table 13. Comparative average processing time of one image by LCA-GAN and state-of-the-art methods. (unit: ms).
Table 13. Comparative average processing time of one image by LCA-GAN and state-of-the-art methods. (unit: ms).
Desktop ComputerJetson TX2 Board
LCA-GAN11.94177.52
AFD-Stack GAN [54]23.03342.8
CFR-GAN [55]35.6541.5
MPRNet [83]21.12318.02
MPRNet* [83]14.04212.24
Pix2pix [57]11.2171.5
Cycle GAN [84]23.2353.1
Table 14. Comparative model complexities of LCA-GAN and state-of-the-art methods.
Table 14. Comparative model complexities of LCA-GAN and state-of-the-art methods.
Number of ParametersGFLOPsMemory Usage (GB)
LCA-GAN57,118,6841.46680.5913
AFD-Stack GAN [54]102,325,1492.67640.5961
CFR-GAN [55]171,588,8762.87301.0925
MPRNet [83]102,725,8569.10924.5598
MPRNet* [83]68,114,3446.07283.0399
Pix2pix [57]57,196,2920.9720.2062
Cycle GAN [84]114,392,5841.9440.4124
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nam, S.H.; Kim, Y.H.; Choi, J.; Park, C.; Park, K.R. LCA-GAN: Low-Complexity Attention-Generative Adversarial Network for Age Estimation with Mask-Occluded Facial Images. Mathematics 2023, 11, 1926. https://doi.org/10.3390/math11081926

AMA Style

Nam SH, Kim YH, Choi J, Park C, Park KR. LCA-GAN: Low-Complexity Attention-Generative Adversarial Network for Age Estimation with Mask-Occluded Facial Images. Mathematics. 2023; 11(8):1926. https://doi.org/10.3390/math11081926

Chicago/Turabian Style

Nam, Se Hyun, Yu Hwan Kim, Jiho Choi, Chanhum Park, and Kang Ryoung Park. 2023. "LCA-GAN: Low-Complexity Attention-Generative Adversarial Network for Age Estimation with Mask-Occluded Facial Images" Mathematics 11, no. 8: 1926. https://doi.org/10.3390/math11081926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop