1. Introduction
In recent years, the attendance system, which refers to a management system to record the attendance of employees, has grown into a friendlier operation. There are many types of attendance systems, such as time clocks, inductive clocking, and biometric authentication systems. Among them, biometric authentication records attendance through fingerprints [
1], face recognition [
2,
3], etc. Biometric identification through the unique characteristics of each person is operated without the need to carry any form of key. This method can not only solve the doubt that the time card will be used by others, but also avoid the problem of the loss of the card. Compared with other methods of biological characteristics, face recognition is the best because it does not require contact or additional actions. Unlike fingerprint or palm recognition, face recognition does not require the trouble of reaching out for recognition, and it also reduces the hygiene problem of multiple people touching the sensor.
The rapid development of artificial intelligence (AI) and machine learning has driven many products to apply biometric recognition to attendance systems, such as fingerprint recognition, palm recognition, voiceprint recognition, and retinal recognition. The difference in biological characteristics between individuals is used as the basis for discrimination to achieve the effect of access control.
In the AI era, CNNs have been successfully applied to computer vision, such as image classification [
4], face recognition [
5,
6], and image segmentation [
7,
8]. The success of AI is mainly attributed to the advancement of deep architecture, GPU computing, and large training datasets. As a result, it has led to achievements in face recognition. At present, computers can already perform better than humans in these aspects [
9,
10].
However, face recognition is quite complicated, and the face of a person is quite variable. Its performance depends on angle, position, light source, hairstyle, expression, lens factor, etc., which will cause considerable differences in the input image.
Recently, due to the spread of COVID-19 around the world, masks have become indispensable items in our lives. Governments encourage people to wear masks to cover their mouths and nose in public to prevent the spread of infection. However, the use of masks poses a huge challenge to face recognition systems [
11], because most of the facial features are blocked. Alzu’bi et al. [
12] showed that MFR has become a hot area with many relevant retrospective studies on masked face recognition (MFR). To achieve a non-contact attendance system, masked face recognition is needed. The design based on CNN always achieves a high recognition rate based on various datasets. However, the existing dataset cannot be applied to faces wearing masks, so a more intuitive approach is necessary to add masks to existing datasets and train a neural network that can maintain high accuracy even when the mouth and nose are covered by masks.
In addition, a set-up action for people to carry out temperature measurements is very time-consuming and can put people at risk of infection. At present, there are many techniques to apply thermal imaging cameras to perform rapid screening of body temperature. These devices are mostly set up and installed at the entrances or exits of public areas such as hospitals, schools, etc. To make the system carry out body temperature measurements to reduce the risk of infection, a remote measurement system combined with people identification is needed.
To achieve a remote method, non-contact infrared temperature measurement modules such as MLX90614 and MLX90615 are applied for this system. These modules can detect the surface temperature of an object; however, the accuracy of measurement is dependent on the cost and size of the infrared sensors. Additionally, the distance between the measurement and a person is largely influenced by the result. In consideration of the whole system, it first detects the face in the area of the attendance system before face recognition. There are mainly two ways to perform face detection: either using depth information to check whether a person’s face is in front of the camera, or using a single RGB camera with a face detection algorithm to perform face detection on the input image. Using depth information to activate face recognition, the overall speed is faster. However, as long as something passes through the camera to generate depth, the system will be activated. As a consequence, the accuracy will be poor, and the false detection rate will be higher. Using the face detection algorithm to detect faces in RGB images can improve the false detection rate, but the overall process speed is slow.
In this paper, a face recognition and temperature measurement system is proposed. It is designed as an attendance system for access control to instantly recognize faces by continuous image input and measure body temperature at the same time. The proposed masked face recognition and temperature measurement are constructed in the embedding system. It has the following key contributions:
The proposed system specifically deals with the problem of low accuracy for people who wear masks. We simulate a mask on the VGGFACE2 [
13] dataset and train FaceNet [
14] to perform masked face recognition. The masked face recognition model achieves great accuracy on the LFW [
15] and MFR2 [
16] datasets.
Face detection and recognition based on deep learning have been implemented on the embedded system of Raspberry Pi 4 to deal with real unconstrained scenes. By integrating a thermal imaging camera, we can perform face recognition and temperature measurement at the same time to assist in managing health status.
We design a user interface (UI) to make the attendance system more convenient. There are three search methods to choose from, namely, search by date, search by month, and search by interval.
This paper is organized as follows.
Section 2 discusses the related works.
Section 3 discusses the proposed algorithm for face detection and recognition in the CNN approach.
Section 4 describes the proposed attendance system with temperature measurement.
Section 5 presents the system results and discussions.
Section 6 concludes this paper.
2. Related Works
In this section, we will introduce the research on face detection and face recognition. To achieve a face recognition system, several steps are required, including image preprocessing, face detection, and facial feature vector extraction. Finally, the extracted vector is compared with the facial feature vectors in the database to perform face recognition.
2.1. Face Detection
Face detection is a key step before many face applications, such as face recognition, face verification, and face tracking. In the past few years, face detection methods have also developed quite well. The research on face detection is based on objection detection. The traditional object detection algorithm is commonly realized by combining a feature extractor, such as HOG, SIFT, or the Haar wavelet, and a classifier, such as SVM, linear regression, or decision tree [
17,
18]. A survey of face detection using hand-crafted features can be found in [
19,
20]. In recent years, convolutional neural networks have achieved great results in object detection and image classification, which have inspired face detection to achieve better results through convolutional neural networks. Li et al. [
21] used a cascade structure and a convolutional neural network to improve the accuracy and speed of detection. At that time, it reached the highest score in the Face Detection Data Set and Benchmark (FDDB) [
22].
A series of developments based on the region convolutional neural network (R-CNN) [
23] was proposed, which uses a two-stage method to achieve the object detection task. R-CNN uses selective search to find 2000–3000 region proposals. Then, it resizes them to the same size and sends them into the CNN to retrieve features, and classifies them using SVM. After classification, linear regression is applied to the bounding box. Fast-RCNN [
24] is proposed to improve R-CNN in that R-CNN calculates 2000 region proposals into the CNN, which requires the individual computation of many repetitive regions, while Fast-R-CNN only calculates the CNN once, and then uses the features extracted from the CNN for 2000 region proposals. Region of interest pooling (RoIpooling) is used to output the extracted regions to the feature map, and then each of them is connected to a fully connected (FC) network to perform softmax classification and bounding box regressor. R-CNN and Fast R-CNN use selective search, which is very time-consuming. Another improved version was proposed in [
25], called Faster R-CNN, which directly uses a neural network to find a positive anchor to generate a region proposal. Compared to selective search, it is a great advancement.
Although two-stage object detection algorithms have achieved great performance, they suffer some problems with long latency and low speed. Several researchers have improved object detection to single-shot models, such as YOLO [
26], using CNN to predict multiple bounding boxes simultaneously and calculate the probability of an object for each box, and single-shot multi-box detector (SSD) [
27], using convolutional layers of different depths to predict targets of different sizes, using lower layers with a larger resolution for small targets, which means that smaller anchors are set on the lower feature maps and larger anchors on the higher feature maps. An improved version of [
23] was proposed, such as YOLOv3 [
28], which introduced residual architecture and FPN [
29] in the network architecture to improve the detection accuracy for small targets; YOLOv4 [
30], which introduced CSPNet [
31], SPP [
32], and PAN [
33] in the network architecture to improve computation speed while maintaining accuracy; and YOLO5Face [
34], which improved the network architecture to improve accuracy in face detection. The widely used architecture, SSD, and other similar architectures predict bounding boxes on multiple layers [
35,
36,
37]. These works execute the object detection with 30 frames per second (fps) or more on GPU. Compared to Fast-RCNN, the method proposed by [
27] does not generate the region proposal but computes the features at once. As a consequence, it can improve the speed of object detection to achieve 59 fps with NVIDIA Titan X.
2.2. Face Recognition
There are two main directions of face recognition, which are holistic features and local features. Holistic features treat the whole face as a single feature for recognition. The local feature approach first identifies the local features on the face, usually the eyes, nose, and mouth. The individual results of the local features are combined to produce the final result, just like one can judge a person by the whole face, or sometimes by certain parts of the five senses.
Similar to face detection, face recognition has also been affected by the rapid development of CNNs in recent years. At present, most methods with high recognition rates use convolutional neural networks to extract features and then recognize them. First, DeepFace [
38], developed by Facebook, set a benchmark far higher than the traditional method and opened the first feature extraction with the CNN. Its most special feature is the use of 3D models to align faces so that, afterward, the CNN can have the greatest effect. In DeepID3 [
10], two very deep neural network architectures are proposed for face representation. These two architectures are constructed from the stacked convolution and inception layers proposed in VGGNet and GoogLeNet, making them suitable for face recognition. During training, joint face identification–verification supervisory signals are added to the intermediate and final feature extraction layers. The set of the two proposed architectures reaches an accuracy of 99.53% on the LFW dataset.
FaceNet was proposed by Google in 2015. Unlike previous face classification networks, which directly output classification results, FaceNet outputs quantified feature values. The detected facial features are then compared with the facial characteristics in the database to output the results. Because FaceNet only extracts features, the advantage of FaceNet over traditional face classification networks is that it does not need to retrain the network when new people are added to the database. The traditional face classification network requires fine-tuning the fully connected layer of the model for the model to recognize the new target. FaceNet is trained by a triplet loss, which is a type of ranking loss used to train samples with low variability. The triplet loss consists of an anchor, a positive sample, and a negative sample. It makes the distance between the anchor and the positive sample as small as possible and the distance between the anchor and the negative sample as large as possible. FaceNet achieves 99.63% on the LFW dataset.
Center loss was proposed by [
39]. Center loss learns the center of the depth feature of each class while penalizing the distance between the depth feature and the corresponding class center. Softmax loss and center loss can increase inter-class dispersion and intra-class compactness at the same time. Both contrastive loss and triplet loss are used to improve the discriminative ability of deep features. However, when based on large-scale data sets, it needs to face the problem of huge sample pairs and sample triplets sampling. Center loss uses the same training data format as softmax loss and does not require complex training data resampling.
In real unconstrained scenes, there are still many challenges in the application of face recognition systems [
40]. Even though ArcFace is powerful [
41], it can only achieve an accuracy of 63.22% on the Real-World Masked Face Recognition Dataset (RMFRD) [
42]. This result is based on [
43] when ArcFace was not retrained on this dataset. Although CNN-based methods reach the goal of high accuracy in face recognition, they still fail to recognize people well when someone is wearing a mask. An intuitive approach is to use the dataset of masked faces to train the model. However, the problem with masked face recognition is the lack of a dataset, which can easily cause the model to overfit. Simulating masks on face datasets such as VGGFACE2 is a way to deal with this problem. In [
16], a tool to mask faces effectively has been proposed. The tool can generate a large dataset of masked faces, and it can be used to train an effective face recognition system with target accuracy for masked faces.
4. Overall Attendance System
The overall attendance system is designed with the face recognition system discussed in the last section. This section discusses the temperature measurement module and the integrated system design including the user interface.
4.1. Temperature Measurement
Infrared is an electromagnetic wave with a wavelength between microwave and visible light, and its wavelength is between 760 nm and 1 mm. From Stefan–Boltzmann’s law, we know that the total radiation emitted from the surface of an object is proportional to the fourth power of its temperature, so the temperature of an object can be calculated by measuring the total radiation energy. We analyzed two temperature sensors to decide on the proper one to combine with the proposed attendance system.
We first investigated the non-contact temperature sensor chip, MLX90614 or MLX90615. It can detect the surface temperature of an object at a distance of about five centimeters with an error of about 0.2 degrees Celsius. It can be used in the appropriate position to detect the surface temperature of the object. However, MLX90614 measures the average temperature of the surface area of an object in the infrared circular projection area. If we want to obtain a more accurate measurement value, the measurement point area of the object under test cannot be larger than the surface area of the object under test; otherwise, the measurement value will be distorted. For example, when the distance is far, the projection on the surface of the object (such as the human face) may contain areas of lower temperature, such as glasses, hair, eyebrows, nose, and other parts. It will make the measured temperature lower than the real body temperature.
Figure 4 shows the impact of the distance on temperature measurement. When the measurement distance is within 5 cm, the result is close to the real body temperature. However, when the measurement distance is out of 5 cm, the detected area may contain a low-temperature area, which makes the result unrealistic.
Then, we considered the other module, FLIR Radiometric Lepton 2.5. Most thermal imaging cameras can detect longer distances although the accuracy is lower, such as the thermal imaging module FLIR Radiometric Lepton provided by Raspberry Pi. The raw data received by Lepton 2.5 is 14 bits, and the value is in kelvin multiplied by 100.
The comparison of the two thermal sensors is shown in
Table 3. Although the accuracy of the MLX90614 is higher than that of a thermal imaging camera, the temperature measurement distance must be very close. If MLX90614 is used as the temperature measuring device in our system, the person to be tested should be prompted to approach MLX90614 for temperature measurement very closely. It does not look friendly to the user. In our system, we use the FLIR Radiometric Lepton 2.5 for body temperature measurement.
In FLIR Radiometric Lepton 2.5, the temperature measurement error can be up to 5 degrees Celsius. Thus, we need to fine-tune the results according to environmental conditions such as room temperature and humidity. The measurement result can be expressed as (2).
The value detected by the thermal imaging camera will be first divided by 100, and then minused by 273.15 to transfer the value into degrees Celsius. Finally, the result will be fine-tuned with some offset according to the environment.
The proposed system converts the value from a thermal imaging camera into a Celsius temperature scale and then compensates the temperature according to the actual measured body temperature. The actual body temperature of different people is measured, and the required compensation value is calculated in the same environment. The final compensation result can make the temperature absolute value error about one degree Celsius.
4.2. User Interface
For the application, we designed a personnel management system using a face recognition network. The attendance system is shown in
Figure 5. The system consists of a UI and a MySQL-based database management system, which allows real-time monitoring of personnel data and matching of IDs in the data set. The qualified IDs will be sent to the database for storing information such as time and temperature, and the unqualified IDs will be put on the visitor list. In addition, the UI also includes a historical access mode, which can go to the database to retrieve the information of the relevant personnel to confirm their attendance status.
There are three kinds of query methods to choose from, namely, search by date, search by month, and search by interval. The search by date can output the time of the day when the person passed through the system, and the search by month/area will output the records of the month/area for each personnel’s leave, sick leave, etc. Additionally, clicking the orange export button in Excel can export Excel reports. Individual details can be displayed by clicking the name button in the data column in the blue box on the right.
Figure 6 shows the face recognition execution system. The main screen is shown on the left, which contains the images captured by the webcam, face frames, and labels. The upper right and the middle right of the image will store the images and label data detected the previous few times, which can provide the information that the inspector loses due to the rapid movement of the inspected personnel. The bottom right of the figure is the currently detected tag data.