Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism

Shi, Xiangbo; Tong, Yala; Mei, Fei; Wu, Zhongjian

doi:10.3390/electronics12081911

Open AccessArticle

Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism

by

Xiangbo Shi

¹,

Yala Tong

^1,*,

Fei Mei

¹ and

Zhongjian Wu

²

¹

School of Science, Hubei University of Technology, Wuhan 430068, China

²

School of Detroit, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(8), 1911; https://doi.org/10.3390/electronics12081911

Submission received: 11 March 2023 / Revised: 14 April 2023 / Accepted: 17 April 2023 / Published: 18 April 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

To address the current problems of the incomplete classification of mask-wearing detection data, small-target miss detection, and the insufficient feature extraction capabilities of lightweight networks dealing with complex faces, a lightweight method with an attention mechanism for detecting mask wearing is presented in this paper. This study incorporated an “incorrect_mask” category into the dataset to address incomplete classification. Additionally, the YOLOv4-tiny model was enhanced with a prediction feature layer and feature fusion execution, expanding the detection scale range and improving the performance on small targets. A CBAM attention module was then introduced into the feature enhancement network, which re-screened the feature information of the region of interest to retain important feature information and improve the feature extraction capabilities. Finally, a focal loss function and an improved mosaic data enhancement strategy were used to enhance the target classification performance. The experimental results of classifying three objects demonstrate that the lightweight model’s detection speed was not compromised while achieving a 2.08% increase in the average classification precision, which was only 0.69% lower than that of the YOLOv4 network. Therefore, this approach effectively improves the detection effect of the lightweight network for mask-wearing.

Keywords:

attention mechanism; feature fusion; mask detection; small targets; YOLOv4-tiny

1. Introduction

Since the emergence of the new coronavirus, wearing a mask has become a norm to prevent the virus from spreading [1,2]. Recently, automatic detection has become the mainstream method for monitoring mask wearing, primarily due to the rapid advancements in artificial intelligence technology. Existing detection algorithms mainly fall into two categories: two-stage algorithms such as R-CNN [3] and Fast-RCNN [4], and one-stage algorithms such as SSD [5] and the YOLO [6,7,8,9,10] series. These algorithms allow us to achieve automated monitoring of mask-wearing using artificial intelligence technology.

In the mask dataset, the RMFD [11] dataset proposed by Wuhan University is a valuable resource for recognition tasks using mask data. However, it does not include a category for masks worn incorrectly in real-world environments. An alternative dataset, the IMFD [12], covers incorrectly worn masks but was synthesized by an algorithm and may not perform well in real-world detection tasks. Vrigkas et al. [13] proposed the publicly available annotated image database “FaceMask”, which performs well. However, the dataset only provides a binary classification of “Mask” or “No_Mask” categories.

In the mask-wearing detection algorithm, Niu et al. [14] incorporated an attention mechanism into the RetinaFace [15] face detection algorithm, improving mask-wearing detection performance. However, the algorithm’s detection speed is relatively low, making it difficult to meet real-time requirements. Wang et al. [16] improved the YOLOv3 algorithm by introducing the spatial pyramid pooling structure [17] for mask detection, achieving a small improvement in the detection accuracy and speed. However, the algorithm struggles with complex mask detection tasks where the backgrounds of the datasets are relatively uniform, and the detection speed is still unsatisfactory. Zhu et al. [18] integrated the spatial pyramid pooling structure and the path aggregation network [19] into the YOLOv4-tiny algorithm to increase the detection accuracy for mask detection tasks. However, after adding the two structures, the algorithm experienced many missed detections for small targets in complex mask detection scenarios as well as decreased speed. Kocacinar et al. [20] proposed a lightweight facial mask recognition system deployed on mobile devices for face mask detection and identity recognition, achieving 99.96% and 82.65% accuracy, respectively. However, this algorithm only utilized individual images and may need to perform more effectively in crowded public spaces. Wei et al. [21] proposed a Mask-YOLO algorithm based on YOLOv3. This effectively addresses the problem of decreased detection accuracy caused by occlusion, density, and small-scale objects by introducing channel attention and complete intersection over union (CIoU) loss. However, the algorithm still needs to improve on the issue of missed and false detections for small targets, and the established data samples are relatively limited. Duan et al. [22] presented an RMPC algorithm for mask-wearing detection based on YOLOv7, which enhances the model’s accuracy by fully integrating feature information by changing the stacking method of the max pooling and convolution layers. Nevertheless, this method increases the model’s parameter quantity and reduces the network’s detection speed. To effectively reduce the spread of COVID-19, Endris et al. [23] established a dataset containing improperly worn masks and trained and tested it on the YOLOX [24] algorithm, achieving a good detection performance. However, most of the data in the dataset were artificially synthesized and did not reflect the real-world situation of mask-wearing detection. Bo et al. [25] replaced the backbone feature extraction network of YOLOv3 with ShuffleNetv2 [26] and introduced the SKNet [27] attention mechanism in the feature enhancement network. This improved the network detection speed at the cost of 1.01% of the detection accuracy, and the detection speed was only improved by 34FPS. Zhang et al. [28] used MobileNetV2 [29] as the feature extraction network based on YOLOv2, which simplified the network model and improved the training speed. However, the division of the dataset was incomplete and the number of parameters remained excessive, resulting in suboptimal detection speed.

To effectively and efficiently detect people wearing masks in public places, important data and efficient algorithms are required. However, the current mask dataset lacks a category for incorrectly worn masks in real-world public environments. In detection algorithms, complex target detection models such as YOLOv3 improve detection accuracy but are slow and unsuitable for deployment on mobile devices. Lightweight networks such as YOLOv4-tiny have fast detection speeds but may experience problems with small target detection and insufficient feature extraction in complex scenarios. This study presents an enhanced approach for detecting targets in a lightweight manner using YOLOv4-tiny, aimed at addressing the issues at hand. First, a new dataset was constructed by segmenting the mask dataset into three categories: not wearing a mask, not wearing a mask correctly, and wearing a mask correctly. Second, to improve the detection capability of YOLOv4-tiny for small targets, a layer of predictive features was incorporated through feature fusion, which expands the detection scale and increases the perceptual field. Additionally, to improve the feature extraction ability of the model and increase its attention toward facial regions, we introduced a convolutional block attention module (CBAM) [30] in the enhanced feature extraction network. Finally, the focal loss [31] function and an improved mosaic data enhancement strategy were used to boost the target classification accuracy.

2. Related Principles

2.1. YOLOv4-Tiny Algorithm

YOLOv4-tiny [32] is a series of simplifications based on YOLOv4 that removes the spatial pyramid pooling structure and path aggregation network, resulting in a network with one-tenth the number of parameters compared to YOLOv4. By employing CSPDarknet53-Tiny as the backbone feature extraction network and replacing the Mish [33] activation function with the LeakyRelu [34] activation function, YOLOv4-tiny simplifies its network structure while enhancing the detection speed. YOLOv4-tiny utilizes the feature pyramid network [35], which reduces the network parameters compared to YOLOv4’s PAN network. Figure 1 illustrates the network structure of YOLOv4-tiny.

The enhanced feature extraction network of YOLOv4-tiny is simple and significantly reduces the number of model parameters. However, we found that the network only outputs two prediction feature layers, with sizes of 256, 26, and 26 and 512, 13 and 13. Due to the model having only two prediction feature layers, it cannot retain the semantic information of shallow feature layers, leading to a decrease in the detection resolution of objects. This affects the detection of small-size targets and may also cause inaccurate boundary localization for large targets. At the same time, the network has a small receptive field, meaning it can only perceive minor local information in an image and cannot obtain a broader global context. This leads to a decrease in the network’s ability to recognize and classify objects. Figure 2 shows the FPN structure in YOLOv4-tiny.

2.2. CBAM Attention Module

The convolutional block attention module (CBAM) combines channel and spatial attention modules to process incoming feature layers effectively. The channel attention mechanism separately applies global average pooling and global maximum pooling operations. The resulting pooled features are then passed through a fully connected layer, summed, and normalized to obtain weights for each channel in the original input feature layer. Finally, each channel of the original input feature layer is multiplied by its corresponding weight. It can be considered a form of weight calculation that enhances the helpful feature channel responses while suppressing the useless channel responses by computing the attention weights. Therefore, this mechanism improves the feature extraction capability of the model.

The spatial attention mechanism computes the maximum and average values of each channel at each feature point of the input feature layer. These values are stacked and passed through a 1 × 1 convolutional layer for channel adjustment. The resulting weights are applied to the input feature layer by multiplication, producing a feature map with spatial weights. This can be considered a weighting of the spatial dimensions. The responses of useful local features can be amplified by multiplying the attention weights by the original feature maps while suppressing the useless background noise. Consequently, this mechanism significantly improves the detection accuracy of the model. Figure 3 displays the structure of this mechanism.

2.3. Data Enhancement

In the field of image recognition, common data augmentation techniques include flipping, scaling and color space manipulation. Mosaic data augmentation, on the other hand, involves randomly flipping, scaling and applying color space manipulation to four images before stitching them together to form a composite image. The advantage of this data enhancement is that it enhances the robustness and the generalization ability of the model. By generating large images containing diverse scenes and objects, the model must learn how to differentiate and detect objects in various regions and precisely classify and identify them during prediction, thus adapting better to complex real-world scenarios. Additionally, this can reduce the model’s training time and computational overhead, resulting in more efficient and effective detection results for practical applications. Figure 4 displays a picture after mosaic data enhancement.

3. YOLOv4-Tiny Mask Detection Algorithm Optimization

The YOLOv4-tiny algorithm has been extensively applied in object detection due to its lightweight characteristic, which enables it to operate efficiently on resource-constrained devices. However, as this algorithm sacrifices some accuracy, it faces challenges such as missing small targets and an insufficient feature extraction capability. To address these issues, this paper proposes a YOLOv4-tiny algorithm that has been improved from the perspectives of algorithm optimization and training strategy while ensuring its detection speed and accuracy. Specifically, we introduced an additional feature layer based on the two existing prediction feature layers in the YOLOv4-tiny algorithm, adopted a bottom-up feature fusion method, and integrated five CBAM attention mechanism modules into the feature enhancement network. Additionally, we replaced the original confidence loss function with a focal loss function and used an improved mosaic data augmentation strategy to train the model and improve its detection performance.

3.1. Improving the Small-Target Detection Capability

The YOLOv4-tiny model only detects feature maps that have undergone down-sampling of 16× and 32× for target detection. Each down-sampling operation halves a feature map’s size, reducing the resolution and losing feature information for small targets. As the number of down-sampling layers increases, the feature map size decreases, resulting in less and less information. This limits the information that is available for small targets, leading to missed detections, particularly in complex mask detection scenarios with many easily missed small targets. We propose expanding the detection range for small targets by adding a feature map prediction layer to the original network after 8× down-sampling processing. To enhance the network’s small target detection ability, we incorporated the FPN structure by merging predicted feature maps from three scales to enrich the feature information. The modified network is called the YOLOv4-tiny-3 model, and its structure is illustrated in Figure 5.

3.2. Introducing the Attention Mechanism

Despite the improved detection speed and parameter reduction achieved by YOLOv4-tiny, its feature extraction capability is drastically diminished. Furthermore, adding a layer of large-scale predictive feature maps to the YOLOv4-tiny algorithm provides additional semantic information, making it easier for the model to mistakenly identify targets with similar objects and potentially leading to misidentified background regions, which affects the overall detection accuracy of the model. We propose introducing the CBAM attention module into YOLOv4-tiny’s feature enhancement network to address the reduced feature extraction capability issue. The CBAM module enhances the feature map by assigning attention weights in both the channel and spatial dimensions, thereby changing the weight distribution of the original features to enhance effective features. It helps the model to better understand the spatial and channel relationships of the target, leading to improved recognition while suppressing false detections in the background. Overall, this improves the detection accuracy and the robustness of the model. For this paper, we added five CBAM modules to the feature enhancement network of the YOLOv4-tiny-3 algorithm to iteratively change the weight distribution of the original feature map, enhance face region attention, and improve network detection accuracy. The network framework of the algorithm is displayed in Figure 6.

3.3. Training Strategy Optimization

The mask dataset used in this paper was highly unbalanced, with a significantly higher number of targets not wearing masks than those not wearing masks correctly. This posed a risk of the model overfitting the “not wearing masks” category and underfitting the “not wearing masks correctly” category during the training process. Furthermore, the dataset contained targets of different complexities within the same category, which can bias a prediction model towards easy-to-classify samples and decrease the performance on hard-to-classify ones, thereby reducing classification ability. We replaced the original confidence loss function with a focal loss function to address this issue. This function is based on a cross-entropy loss function with an additional category weight coefficient and moderating factor, improving the model’s ability to identify and generalize over a few categories, focus more on hard-to-classify samples, and reduce the influence of easy-to-classify ones during model training.

To further enrich the image background, improve model robustness and enhance training efficiency, we propose an improved mosaic data enhancement method in this paper. This method synthesizes the four original images into six images that are then sent to the model for training. Given that the images generated by the mosaic method differ from the real distribution of natural images, we set a 50% probability of using mosaic data enhancement in each step and only used it in the first 70% of the epochs. This approach enriched the image background and ensured a real distribution of images while improving model robustness. Figure 7 displays a picture after improved mosaic data enhancement.

4. Experimental Results and Analysis

4.1. Datasets

The lack of well-developed public datasets for mask detection in public space scenarios, particularly for incorrectly worn masks, poses a challenge for current research in this area. To address this issue, we collated and constructed a three-category dataset consisting of correctly worn masks (face_mask), masks not worn (face), and improperly worn masks (incorrect_mask). First, we selected 7393 images of targets without masks and 5094 images of targets with correctly worn masks from CMFD and RMFD. Additionally, we obtained 2000 artificially synthesized images of incorrectly worn masks from the IMFD dataset. To augment the real data of incorrectly worn masks in public places, we captured numerous photos containing incorrectly worn masks. We obtained 2625 high-quality, diverse and realistic images of incorrectly worn masks after screening and sorting. All the above data were sorted and labeled to generate a complete mask dataset comprising 17,112 images. Pictures containing face targets contained many small objects, and the backgrounds of the photographs were complex.

4.2. Experimental Environment

The experiments using the Keras environment are presented in Table 1.

4.3. Experimental Protocols

To test the algorithm’s effectiveness, we validated it experimentally via the following aspects: the P-R curve, the precision rate, the recall rate, the AP value, the detection speed, the actual scene effect, and an ablation experiment. A P-R curve provides an intuitive reflection of a classifier’s performance, while precision and recall rates measure accuracy and completeness. An AP value is a comprehensive evaluation index for precision and recall rates. The detection speed is critical to a classifier’s real-time performance, and the actual scene effect measures its practicality. An ablation experiment validated the improvements implemented in this paper. AP and mAP values are calculated as follows:

A P = \int_{0}^{1} P (R) d R

(1)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P i

(2)

We measured each category’s AP value by calculating the area under the precision-recall curve (P(R)). The mAP is the average AP value of all classes, with N representing the number of categories. Our study set N to 3.

4.4. Analysis of the Experimental Results

4.4.1. Comparison of P-R Curves

We compared the P-R curves of our proposed algorithm, YOLOv4-tiny, YOLOv4-tiny-3 and literature 18 for face target detection results under the same experimental conditions and methodology, as shown in Figure 8. The results indicate that the YOLOv4-tiny-3 algorithm outperforms YOLOv4-tiny and literature 18 on unmasked targets (face) based on the P-R curve and the area formed by the X axis and Y axis, which represent the AP value. It can be observed that even after adding a predictive feature layer to expand the detection scale, the algorithm still retains feature information of small targets after down-sampling, which improves small target recognition and enhances network detection accuracy. Our proposed algorithm outperformed the YOLOv4-tiny-3 algorithm in terms of accuracy on face targets. By using an attention mechanism within the feature-enhanced network, our algorithm filters out non-critical target information and concentrates on the vital face region. Consequently, this improves the network’s ability to detect faces with increased accuracy. Our proposed algorithm’s performance on correctly worn mask targets (face_mask) and incorrectly worn mask targets (incorrect_mask) aligned with the other two algorithms. This was primarily due to the YOLOv4-tiny algorithm’s superior detection accuracy on these targets, limiting the algorithm’s capability to make any significant improvements.

4.4.2. Comparison of Precision, Recall and AP Values

The precision rate, recall rate and AP values for the algorithm in this paper and YOLOv4, YOLOv4-tiny, YOLOv4-tiny-3 and literature 18 for face, face_mask and incorrect_mask targets are presented in Table 2. In contrast to the YOLOv4-tiny algorithm, our proposed algorithm demonstrated a precision rate on the face targets that increased by 6.09% from 86.21% to 92.30%, and there was a small increase in the precision rate on other targets. Additionally, the AP value increased from 77.64% to 80.93%, an increase of 3.29%, and was maintained at the same level as for other targets. In contrast to the YOLOv4-tiny-3 algorithm, our proposed algorithm demonstrated an accuracy rate on the face targets that increased by 0.88% from 92.30% to 93.18. The AP value increased from 80.93% to 83.01%, an increase of 2.08%, and there was a small increase in the AP values for other targets. The algorithm in this paper improved the AP values for each objective by 4.36%, 0.68%, and 1.02%, respectively, compared with the literature 18. Compared with YOLOv4, our proposed algorithm achieved a difference of 2.04% and 0.32% in the AP values for the face and incorrect_mask targets, respectively. For the face_mask target, our algorithm showed a relatively higher AP value. This was because the face targets were more abundant in our dataset and the image backgrounds were more complex. Therefore, a deeper network structure such as YOLOv4 can improve detection performance under complex backgrounds. However, for relatively simple data, the improvement effect of YOLOv4 is insignificant. Therefore, adding a predictive feature layer on the basis of YOLOv4-tiny enhanced the network’s detection precision by solving the problem of many small targets being missed in complex scenes. Additionally, we introduced the CBAM attention mechanism within the feature-enhanced network to mitigate the lightweight network’s weak feature extraction ability. It further improved the network’s detection accuracy and resolved the lightweight network’s poor feature extraction ability issue.

4.4.3. Comparison of Detection Speed

The Table 3 presents the mAP, the FPS and the number of model parameters for each target in the dataset, comparing YOLOv4, YOLOv4-tiny, YOLOv4-tiny-3, literature 18 and our proposed algorithm. Our results indicated that compared to the YOLOv4-tiny algorithm, the mAP of the YOLOv4-tiny-3 algorithm increased by 1% from 90.97% to 91.97%. In this paper, the mAP value further increased by 1.08% compared to the YOLOv4-tiny algorithm and by 2.08% compared to the YOLOv4 algorithm, with only an 8.21 frame-s⁻¹ decrease in detection speed. The difference in mAP compared to the YOLOv4 algorithm was only 0.69%, while the number of parameters decreased by 90.27%. Furthermore, the detection speed increased from 11.29 frame-s⁻¹ to 70.22 frame-s⁻¹, an increase of 58.93 frame-s^-1. This is because the YOLOv4 network had more layers and a more complex model, providing a more vital ability to extract features from complex face data. However, having more convolutional layers can lead to a significant decrease in detection speed. Compared with literature 18, the mAP value was improved by 2.01% and the detection speed increased from 67.20 frame-s⁻¹ to 70.22 frame-s⁻¹. Our results demonstrated that our proposed algorithm significantly enhances the model’s detection accuracy in mask-wearing detection scenarios using lightweight networks while maintaining a fast detection speed.

4.4.4. Comparison of Actual Scene Effects

Figure 9 illustrates a comparison of the detection outcomes of YOLOv4-tiny and our proposed algorithm for various scene images, demonstrating the effectiveness of our algorithm in real-world scenarios. The results indicate that the YOLOv4-tiny algorithm missed small targets in images (a) and (b). In contrast, this paper’s algorithm detected them in images (d) and (e) thanks to the additional large-scale predictive feature layer and FPN feature fusion. These enhancements enabled the model to retain small target feature information and obtain a broader range of contextual semantic details, resulting in improved detection capabilities.

Additionally, image (c) shows that the YOLOv4-tiny algorithm incorrectly detected an improperly masked target as correctly masked, while image (f) demonstrates that this paper’s algorithm detected all targets accurately. This improvement resulted from integrating an attention mechanism into the enhanced feature extraction network of YOLOv4-tiny, along with the utilization of focal loss and an improved mosaic data enhancement strategy. These enhancements effectively enhanced the model’s feature extraction ability.

4.4.5. Ablation Experiments

To validate the effectiveness of the improved strategy for the YOLOv4-tiny algorithm and training, we designed several sets of ablation experiments and their results are presented in Table 4. A “√” in the table indicates the use of the structural modification method or training strategy, while “×” indicates that the operation was not performed.

The results show that in the first group of experiments, adding one prediction feature layer to the YOLOv4-tiny algorithm resulted in an mAP value of 91.25%, indicating that the algorithm retained more small-target semantic information, and further feature fusion improved its detection ability for small targets. In the second group of experiments, the attention module was introduced in the enhanced feature extraction network of the YOLOv4-tiny algorithm, resulting in an mAP value of 91.57%, indicating an improved feature extraction ability.

Groups 3 and 4 added focal loss and an improved mosaic data enhancement strategy, respectively, based on groups 1 and 2. The mAP values improved by 0.72% and 0.85%, respectively, indicating that this paper’s training strategy was effective and alleviated the negative impact of the dataset. Group 5 incorporated all the improved strategies, resulting in an mAP of 93.05%.

5. Conclusions

The data classification for mask-wearing detection is incomplete, and lightweight networks have limitations in detecting small targets and extracting features. This paper introduces a novel approach for lightweight mask-wearing detection based on an attention mechanism. The mask dataset was subdivided, and a category for incorrectly worn masks was added based on data for both unworn and worn masks. Our proposed algorithm expanded the detection scale range of YOLOv4-tiny and combined profound speech information with shallow semantic information to improve the accuracy of small target detection. Additionally, to enhance feature extraction in lightweight networks, our approach incorporated five attention modules into the YOLOv4-tiny feature enhancement network for the more efficient screening of feature information in regions of interest. Training and testing were conducted using the tri-classified dataset, and the results showed that compared with YOLOv4-tiny, this paper’s algorithm achieved a 6.97% improvement in accuracy for face targets, with a high mAP of 93.05%, and maintained a detection speed of 70.22FPS. Compared with the YOLOv4 algorithm, the mAP differed by only 0.69%, but the detection speed improved significantly. Unfortunately, there is still a gap in detection accuracy between our proposed algorithm and YOLOv4, and we could not achieve a significant improvement in detecting incorrectly and correctly masked targets due to the dataset’s imbalance. Moreover, we only tested the algorithm’s detection speed on a personal portable laptop. We did not apply it to mobile edge or realistically simulate a scenario of mask detection using mobile devices in public places. In our future work, we will continue to expand our dataset and constantly improve the network algorithm structure, aiming to enhance the detection accuracy of the network while ensuring detection speed. This will be achieved by optimizing our model and incorporating more advanced techniques to improve the performance of our system.

Author Contributions

Funding acquisition, Y.T.; methodology, X.S.; validation, X.S., F.M. and Z.W.; resources, X.S.; data curation, F.M.; writing—original draft preparation, X.S.; writing—review and editing, Z.W.; supervision, Y.T.; project administration, Y.T. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hubei Province, China (grant number 2021CFB584).

Data Availability Statement

The data are contained within the article.

Acknowledgments

The authors are very grateful to the editors and reviewers for their suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liang, M.; Gao, L.; Cheng, C.; Zhou, Q.; Uy, J.P.; Heiner, K.; Sun, C. Efficacy of face mask in preventing respiratory virus transmission: A systematic review and meta-analysis. Travel Med. Infect. Dis. 2020, 36, 101751. [Google Scholar] [CrossRef] [PubMed]
Parente, C.; Montgomery, R.; Berry, L.; Mahida, N. Impact of universal mask wearing in reducing healthcare-associated respiratory virus infections in haematology patients. J. Hosp. Infect. 2022, 119, 192–193. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:10934. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:02696. [Google Scholar]
Wang, Z.; Wang, G.; Huang, B.; Xiong, Z.; Hong, Q.; Wu, H.; Yi, P.; Jiang, K.; Wang, N.; Pei, Y. Masked face recognition dataset and application. arXiv 2020, arXiv:09093. [Google Scholar]
Cabani, A.; Hammoudi, K.; Benhabiles, H.; Melkemi, M. MaskedFace-Net–A dataset of correctly/incorrectly masked face images in the context of COVID-19. Smart Health 2021, 19, 100144. [Google Scholar] [CrossRef] [PubMed]
Vrigkas, M.; Kourfalidou, E.A.; Plissiti, M.E.; Nikou, C. FaceMask: A New Image Dataset for the Automated Identification of People Wearing Masks in the Wild. Sensors 2022, 22, 896. [Google Scholar] [CrossRef] [PubMed]
Niu, Z.; Qin, T.; Li, H.; Chen, J. Improved algorithm of RetinaFace for natural scene mask wear detection. Comput. Eng. Appl. 2020, 56, 1–7. [Google Scholar]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5203–5212. [Google Scholar]
Wang, Y.; Ding, H.; Li, B.; Yang, Z.; Yang, J. Mask wearing detection algorithm based on improved YOLOv3 in complex scenes. Comput. Eng. 2020, 46, 12–22. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Wang, J.; Wang, B. Lightweight mask detection algorithm based on improved YOLOv4-tiny. Chin. J. Liq. Cryst. Disp. 2021, 1525–1534. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Kocacinar, B.; Tas, B.; Akbulut, F.P.; Catal, C.; Mishra, D. A Real-Time CNN-Based Lightweight Mobile Masked Face Recognition System. IEEE Access 2022, 10, 63496–63507. [Google Scholar] [CrossRef]
Wei, M.; Zhou, T.; Ji, Z.; Zhang, X. Mask Wearing Detection in Complex Scenes Based on Mask-YOLO. J. Appl. Sci. 2022, 40, 93–104. [Google Scholar]
Duan, X.; Chen, H.; Lou, H.; Bi, L.; Zhang, Y.; Liu, H. A more accurate mask detection algorithm based on Nao robot platform and YOLOv7. In Proceedings of the 2023 IEEE 3rd International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 19–31 January 2023; pp. 1295–1299. [Google Scholar]
Endris, A.; Yang, S.; Zenebe, Y.A.; Gashaw, B.; Mohammed, J.; Bayisa, L.Y.; Abera, A.E. Efficient Face Mask Detection Method Using YOLOX: An Approach to Reduce Coronavirus Spread. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 568–573. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Bo, J.W.; Zhang, C.T. Lightweight mask wearing detection algorithm based on YOLOv3. Electron. Meas. Technol. 2021, 044, 105–110. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Zhang, L.P.; Li, Z.H.; Tang, Y.L. Light-YOLOv2 mask wearing detection method based on transfer learning. Electron. Meas. Technol. 2022, 45, 112–117. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:08681. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the Icml, Atlanta, GA, USA, 16–21 June 2013; p. 3. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]

Figure 1. YOLOv4-tiny network structure.

Figure 2. FPN network structure in YOLOv4-tiny.

Figure 3. CBAM attention module.

Figure 4. Mosaic data enhancement.

Figure 5. YOLOv4-tiny-3 network structure.

Figure 6. The network structure of the algorithm in this paper.

Figure 7. Improved mosaic data enhancement.

Figure 8. P-R curves for different algorithms with different targets.

Figure 9. Comparison of YOLOv4-tiny and our algorithm for partial-scene mask detection.

Table 1. Experimental platform.

Name	Parameters
Operating System	Windows 10
CPU	Intel(R) Core(TM) i5-9400 CPU @ 2.90 GHz
GPU	GeForce RTX 1650
RAM	16 GB
CUDA	10.0
Keras	2.1.5
TensorFlow	1.13.2
Python	3.6

Table 2. Precision, recall and AP values for different targets of different networks.

Models	Face (%)			face_mask (%)			incorrect_mask (%)
Models	Precision	Recall	AP	Precision	Recall	AP	Precision	Recall	AP
YOLOv4	93.33	79.33	85.05	95.91	95.52	98.21	96.79	95.37	97.96
YOLOv4-tiny	86.21	74.23	77.64	94.97	94.84	98.01	96.72	93.26	97.27
YOLOv4-tiny-3	92.30	72.25	80.93	95.35	94.57	98.10	97.35	92.63	96.87
Literature 18	87.49	76.30	78.65	95.18	96.47	97.83	95.54	94.74	96.62
Ours	93.18	75.68	83.01	96.16	95.12	98.51	96.51	93.26	97.64

Table 3. mAP, FPS and the number of parameters for different networks.

Model	mAP (%)	FPS (Frame·s⁻¹)	Parameters (M)
YOLOv4	93.74	11.29	64.0
YOLOv4-tiny	90.97	78.43	5.9
YOLOv4-tiny-3	91.97	73.26	6.1
Literature 18	91.04	67.20	9.18
Ours	93.05	70.22	6.2

Table 4. Comparison of ablation experiment results.

ID	Predictive Feature Layer	CBAM	Focal Loss	Mosaic	mAP (%)
1	√	×	×	×	91.25
2	×	√	×	×	91.57
3	√	×	√	√	91.97
4	×	√	√	√	92.42
5	√	√	√	√	93.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, X.; Tong, Y.; Mei, F.; Wu, Z. Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism. Electronics 2023, 12, 1911. https://doi.org/10.3390/electronics12081911

AMA Style

Shi X, Tong Y, Mei F, Wu Z. Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism. Electronics. 2023; 12(8):1911. https://doi.org/10.3390/electronics12081911

Chicago/Turabian Style

Shi, Xiangbo, Yala Tong, Fei Mei, and Zhongjian Wu. 2023. "Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism" Electronics 12, no. 8: 1911. https://doi.org/10.3390/electronics12081911

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research and Optimization of a Lightweight Refined Mask-Wearing Detection Algorithm Based on an Attention Mechanism

Abstract

1. Introduction

2. Related Principles

2.1. YOLOv4-Tiny Algorithm

2.2. CBAM Attention Module

2.3. Data Enhancement

3. YOLOv4-Tiny Mask Detection Algorithm Optimization

3.1. Improving the Small-Target Detection Capability

3.2. Introducing the Attention Mechanism

3.3. Training Strategy Optimization

4. Experimental Results and Analysis

4.1. Datasets

4.2. Experimental Environment

4.3. Experimental Protocols

4.4. Analysis of the Experimental Results

4.4.1. Comparison of P-R Curves

4.4.2. Comparison of Precision, Recall and AP Values

4.4.3. Comparison of Detection Speed

4.4.4. Comparison of Actual Scene Effects

4.4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI