Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection

Riaz, Mudassar; He, Jianbiao; Xie, Kai; Alsagri, Hatoon S.; Moqurrab, Syed Atif; Alhakbani, Haya Abdullah A.; Obidallah, Waeal J.

doi:10.3390/electronics12224675

Open AccessArticle

Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection

by

Mudassar Riaz

¹,

Jianbiao He

¹,

Kai Xie

²,

Hatoon S. Alsagri

^3,*

,

Syed Atif Moqurrab

⁴

,

Haya Abdullah A. Alhakbani

³ and

Waeal J. Obidallah

³

¹

School of Computer Science and Engineering, Central South University, Changsha 410017, China

²

School of Electronics and Information, Yangtze University, Jingzhou 434023, China

³

College of Computer and Information Sciences, Imam Muhammad Ibn Saud Islamic University (IMSIU), Riyadh 11673, Saudi Arabia

⁴

School of Computing, Gachon University, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si 13120, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4675; https://doi.org/10.3390/electronics12224675

Submission received: 5 September 2023 / Revised: 22 October 2023 / Accepted: 6 November 2023 / Published: 16 November 2023

(This article belongs to the Special Issue Application of Machine Learning in Big Data)

Download

Browse Figures

Versions Notes

Abstract

:

Accidents occur in the construction industry as a result of non-compliance with personal protective equipment (PPE). As a result of diverse environments, it is difficult to detect PPE automatically. Traditional image detection models like convolutional neural network (CNN) and vision transformer (ViT) struggle to capture both local and global features in construction safety. This study introduces a new approach for automating the detection of personal protective equipment (PPE) in the construction industry, called PPE_Swin. By combining global and local feature extraction using the self-attention mechanism based on Swin-Unet, we address challenges related to accurate segmentation, robustness to image variations, and generalization across different environments. In order to train and evaluate our system, we have compiled a new dataset, which provides more reliable and accurate detection of personal protective equipment (PPE) in diverse construction scenarios. Our approach achieves a remarkable 97% accuracy in detecting workers with and without PPE, surpassing existing state-of-the-art methods. This research presents an effective solution for enhancing worker safety on construction sites by automating PPE compliance detection.

Keywords:

PPE detection; Swin-Unet; construction safety; deep learning; image dataset

1. Introduction

The construction industry is notorious for its heightened vulnerability to fatal accidents, which raises significant concerns about the safety and well-being of its employees. Construction sites, often sprawling over large areas with a high concentration of workers, pose significant challenges in establishing an effective safety monitoring system. Human monitoring capabilities fall short of meeting this challenge [1]. Given these difficulties and the inherent risks associated with construction activities, it becomes imperative to gain a comprehensive understanding of the factors contributing to accidents and to take additional mitigation measures.

Non-compliance with personal protective equipment (PPE) regulations, notably, can have a significant impact on worker safety. This significance is underscored by the International Labor Organization (ILO) statistics, which report approximately 60 million accidents annually due to negligence in using personal protective equipment (PPE) [2,3]. These statistics highlight the critical importance of emphasizing and enforcing PPE protocols within construction environments.

In addition, concerns regarding worker comfort and the perception that workers find personal protective equipment uncomfortable contribute to non-compliance. In this regard, the risks faced by workers are amplified [4]. The use of improper personal protective equipment (PPE) during the 2017–2018 reporting period was identified as one of the most serious violations of the rules implemented by the Occupational Safety and Health Administration (OSHA) [5,6,7].

In order to meet this challenge, innovative and automated approaches are required. Manual monitoring is not only time-consuming but also costly and impractical, particularly when dealing with a large workforce [8]. To identify and monitor the use of personal protective equipment by workers, researchers have studied a variety of techniques, including computer vision, deep learning, and sensor-based systems [9]. Integrating AI technologies offers significant advantages in terms of efficiency, accuracy, and real-time monitoring. Using computer vision algorithms, artificial intelligence systems can detect whether workers are wearing personal protective equipment (PPE), such as hard hats and vests, by analyzing video feeds or images. As a result of deep learning, detection capabilities are further enhanced by the ability to recognize specific PPE items, ensuring compliance with safety regulations. Research in these areas is still at a very early stage, despite the potential of artificial intelligence to automate PPE detection [10,11]. As a result of background research, it is evident that it is vital to accurately detect personal protective equipment, as misidentification or under-detection can have severe consequences. There are still challenges associated with accurate segmentation, robustness to image variation, and generalization to different environments. It is therefore imperative that PPE detection systems be reliable and robust [12,13].

By developing a system with maximum robustness and consistent performance under varying environmental conditions, this study aims to overcome these limitations. This study introduces a new dataset named PPE-dataset in order to provide a detection system that has improved generalization capabilities. There are 1109 images in this dataset depicting regular laborers and fieldworkers wearing personal protective equipment in various environments. Our main objective is to develop a model that can effectively manage image variations through precise segmentation. This study aims to improve the accuracy and reliability of PPE detection while minimizing the risks associated with inadequate protection. The Swin-Unet architecture combines transformer and encoder–decoder structures for image segmentation [14,15].

For segmentation purposes, the Swin-Unet architecture is designed to effectively extract global features from an image. By using a hierarchical structure, the image is divided into different scales or levels. Each scale captures information about the entire image context, allowing it to understand the broader context and relationships between various parts of the image. To analyze the relationships between different parts of an image, Swin-Unet uses a self-attention mechanism inspired by transformer models. As a result of this self-attention mechanism, the model can give greater weight to relevant global information as it processes the image. For tasks such as segmentation, it is capable of attending to distant regions and understanding how they relate to local features. By breaking the image into different scales, Swin-Unet pays attention to important global details and how they are interconnected with smaller details. As a result, the model is able to comprehend the larger picture, which makes it useful for tasks such as identifying PPE compliance in construction environments.

This study aims to address the critical issue of safety compliance through the development of a robust image analysis system. Our objective is to resolve the problem of the accurate identification of images in which individuals are not wearing proper personal protective equipment. Our ultimate goal is to provide a binary response for each image, indicating the absence of essential personal protective equipment. A binary response will be set as 1 for workers who do not wear personal protective equipment, 2 for workers who do wear PPE correctly, and 0 as a background response. As a result of this binary outcome, workplace safety will be enhanced and strict PPE compliance will be ensured, thereby minimizing risk and safeguarding individuals in potentially hazardous environments.

To overcome all these challenges, we used a self-attention mechanism based on the Swin-Unet model for extracting global features and image segmentation, while the Unet-based encoder and decoder architecture was used to extract local features for image detection. These models are well suited for detecting PPE on construction sites due to their ability to identify and localize items accurately within images. Their robustness and efficiency enable accurate detection and segmentation of PPE items in various environmental settings. This cross-domain approach ensures the model’s effective generalization and adaptability to different site conditions, ultimately enhancing worker safety and reducing risks associated with inadequate PPE compliance.

The authors of this study strive to successfully implement a robust and generalized system that ensures accurate safety gear detection on construction sites, even in challenging conditions such as rain and haze. To this end, this study introduces an anchor-free training architecture for detecting workers with and without PPE on construction sites. By leveraging artificially generated conditions, the proposed model demonstrates heightened robustness and adaptability in handling challenging scenarios, thereby surpassing other state-of-the-art methods for detecting personal protective equipment.

2. Related Work

Several studies have examined the safety and quality concerns in the construction industry, particularly the heightened vulnerability to fatal accidents. The issue of non-compliance with personal protective equipment (PPE) regulations has emerged as an important concern that affects both safety and quality. According to the International Labor Organization, approximately 60 million accidents resulting from PPE negligence occur every year, underlining the importance of effective enforcement of PPE. It is therefore necessary to develop innovative automated approaches to address this issue, such as computer vision systems and deep learning systems that can accurately detect the use of personal protective equipment.

Role of Computer Vision and Sensor-Based Technologies

Researchers have delved into various techniques such as computer vision, deep learning, and sensor-based systems to accurately identify and monitor workers’ PPE usage [11]. Sensor-based PPE detection systems can incorporate a variety of sensors such as RFID, proximity sensors, or wearable technology [12]. These systems provide a number of advantages, including enhanced accuracy in controlled environments and the ability to detect PPE usage even in low-light or obscured environments. The use of sensor-based systems is often less intrusive than the use of camera-based systems, which may alleviate privacy concerns. Sensor-based PPE detection systems are, however, typically less versatile and adaptable. They often require dedicated hardware and infrastructure, which can be expensive to install and maintain. They may not be as effective at identifying a wide variety of PPE types, and their real-time monitoring capabilities may not be as robust as computer vision systems.

On the other hand, computer vision systems are able to provide real-time feedback and can handle a wide range of personal protective equipment, making them an appropriate choice for dynamic environments [13,16]. As a result of their capability to process a large amount of visual data, they are also able to detect personal protective equipment with high accuracy, thus reducing the risk of false positives and negatives. In spite of this, computer vision systems do have some limitations [17]. In challenging environments with poor visibility, they may struggle due to the lack of consistent lighting conditions [18,19,20] Further, they require significant computational power, which can be a limitation in applications with limited resources [21,22,23,24].

Machine learning algorithms are utilized by computer-vision-based PPE detection systems to analyze visual data from cameras and to detect instances of the use of personal protective equipment (PPE) [25,26]. There are several distinct advantages to using these systems. Their versatility makes them easy to adapt to different settings since they rely primarily on existing infrastructure for cameras. In situations where it may be impractical or costly to install dedicated sensors, this adaptability is of particular value [27]. Despite the potential of machine learning to automate PPE detection, research in these areas is still in its early stages [28]. Background research underscores the significance of accurately detecting PPE, as misidentification or under-detection can lead to severe consequences [29,30]. Challenges associated with accurate segmentation, robustness to image variation, and generalization to different environments persist [31]. Therefore, a reliable and robust PPE detection system is needed.

This study aims to overcome these limitations by developing a system that maximizes robustness and maintains consistent performance under diverse environmental conditions. Recognizing the need for a detection system with improved generalization capabilities, this study introduces a new dataset named PPE-dataset. This dataset comprises 1109 images of regular laborers and fieldworkers wearing PPE in various environments.

The primary goal is to focus on precise segmentation and to develop a model capable of effectively managing image variations. This study intends to improve PPE detection accuracy and reliability while minimizing the risks associated with inadequate protection. Recent studies have introduced a transfer learning architecture called Swin-Unet, which combines transformer and encoder–decoder structures for image segmentation [14,15]. Swin-Unet models purely based on CNN and Swin transformer excel in extracting global features and image segmentation, while Unet encoders and decoders effectively extract local features for image detection [16,32]. These models are well suited for detecting PPE on construction sites due to their ability to identify and localize items accurately within images. Their robustness and efficiency enable accurate detection and segmentation of PPE items in various environmental settings [33]. This cross-domain approach ensures the model’s effective generalization and adaptability to different site conditions, ultimately enhancing worker safety and reducing the risks associated with inadequate PPE compliance.

The authors of this study strive to successfully implement a robust and generalized system that ensures accurate safety gear detection on construction sites, even in challenging conditions such as rain and haze. To this end, the study introduces an anchor-free training architecture for detecting workers with and without PPE on construction sites. By leveraging artificially generated conditions, the proposed model demonstrates heightened robustness and adaptability in handling challenging scenarios, thereby surpassing other state-of-the-art methods for detecting personal protective equipment. In Table 1, we summarize these studies of computer vision techniques used to detect personal protective equipment (PPE) on construction sites. The methods employed in these studies include RCNNs, YOLO v3, YOLOv5, and Mask RCNN.

Addressing the limitations of existing studies, we present a novel approach to detecting and identifying personal protective equipment (PPE) for workers. Our proposed method involves a hybrid model that combines an encoder–decoder architecture with self-attention mechanisms, specifically based on Unet and Swin-Unet. This approach enables the accurate detection of workers wearing and not wearing PPE through segmentation. By integrating photometric changes into the images and leveraging transfer learning, our method effectively addresses challenges related to precise PPE segmentation, robustness to image variations, and generalization across diverse environments. Moreover, our automated monitoring and detection system promotes occupational health and safety, creating a safer working environment. The results of our study yield better performance than other state-of-the-art methods. Furthermore, the model’s impressive performance on construction sites raises intriguing possibilities for its application in other industries that require personal protective equipment (PPE). The use of specialized safety gear is necessary in many sectors, including manufacturing, healthcare, agriculture, and laboratory work. This model can be leveraged to enhance safety and efficiency in a variety of industries by leveraging its expertise and capabilities. The ability of the system to identify PPE use and to ensure compliance might prove invaluable in reducing workplace accidents, improving the overall quality of the job, and protecting the health and well-being of employes.

3. Materials and Methods

3.1. Dataset Preparation

Effective data gathering and preparation are crucial for training machine learning algorithms and achieving precise identification. While sensor-based PPE detection systems have been explored in previous studies, this study focuses on a computer vision (CV)-based system due to its advantages of lower cost, simplicity, and ease of use in field settings. Notably, at a construction site, detecting both hard hats and vests is essential for safety compliance. To make the CV system more practical for an alarm system, this study emphasizes the detection of individuals wearing both hard hats and vests, as well as those not wearing either of these safety gear items. The focus is on detecting the presence or absence of hard hats and vests in different construction environments, rather than distinguishing between different hard hat colors or other classes. This study addresses a significant issue highlighted by the International Labor Organization, which reports an annual occurrence of approximately 60 million on-site accidents resulting from non-compliance with PPE regulations, leading to an estimated 2.3 million fatalities [2]. To address this challenge, this study utilizes online image scraping to gather a dataset of over 1600 images featuring regular laborers and fieldworkers wearing personal protective equipment from various construction sites to test the model’s robustness and generalizability across different settings. Using web scraping to collect a dataset, such as the 1109 PPE images from various online sources, also has several biases and limitations. In addition, web scraping relies on publicly available data, which may lead to an unbalanced dataset that does not adequately represent the diversity of construction scenarios and the use of personal protective equipment. It is important to note that web-scraped images can vary significantly in terms of quality and consistency, which can affect the ability of the model to generalize to real-world scenarios. Ensuring accurate labeling and providing contextual information about construction sites, tasks, and safety regulations can be challenging.

Using the PPE_Swin self-attention mechanism and encoder–decoder-based Unet architecture, this proposed study has the potential to effectively overcome these challenges that have historically posed limitations. Due to its unique architecture, combining global and local feature extraction techniques, resilience to image variations, and innovative training techniques, it is positioned as a promising solution for addressing these challenges. This combination of capabilities allows PPE_Swin to enhance its real-world applicability and to excel in the accurate detection of PPE compliance, surpassing the limitations of previous methods.

The dataset includes two classes: “Person with PPE” denoting individuals wearing both hard hats and vests, and “Person without PPE” representing individuals not wearing either item (Figure 1). The importance of hard hats as a primary safety gear and vests for visual observation of workers at a distance is emphasized. The images in the dataset have a standardized size of 512 × 512 pixels, ensuring consistency in the data used for training and evaluation.

The preparation and collection of datasets play an essential role in the training and identification of machine learning algorithms. This study utilized a rigorous approach to curate a high-quality dataset. Initially, a filtering process was conducted to remove any irrelevant images that were watermarked or deemed unrelated to the research objective. This step was crucial to maintain the dataset’s integrity and to eliminate potential sources of bias or noise. Following the filtering process, over 1600 images were carefully selected, ensuring diversity in terms of different individuals, scenarios, and environmental conditions. This diversity is essential to enhance the model’s ability to generalize and accurately identify the target classes of regular laborers and fieldworkers wearing PPE. A detailed overview of the object-wise separation analysis can be found in Table 2, which provides an overview of the distribution of regular laborers and fieldworkers within each subset of the dataset. This verification step ensures that each dataset subset adequately represents the two classes, further enhancing the model’s ability to accurately identify and differentiate between regular laborers and fieldworkers wearing PPE.

Figure 2 illustrates the general research methodology pipeline used in this study, from the environmental setup to the development of the deep learning model. In our proposed method, the dataset was collected through web crawling in the first step. In the second step, the collected data were divided into different classes, and then the annotation process was performed on it. After that, the models were trained on the given dataset, and different validation checks were performed after training.

3.2. The Proposed Framework

This study introduces a novel PPE_Swin model designed to achieve precise detection and segmentation of personal protective equipment (PPE) worn by construction site workers. This is accomplished through a two-stage approach that combines the strengths of different architectures. In the first stage, a Unet-based encoder–decoder architecture is utilized for PPE detection. This architecture effectively captures and extracts features from the input images. By leveraging the encoder–decoder structure, the model can generate high-quality feature representations at various scales, leading to enhanced accuracy in PPE detection. The second stage involves the integration of a Swin-Unet-based self-attention mechanism for PPE segmentation. The use of image segmentation is essential for tasks such as PPE detection since it is capable of precisely locating objects within an image, extracting regions of interest, and providing spatial information in detail. A segmentation map is used in this study to identify PPE by highlighting areas where protective equipment should be present. By analyzing these segmented regions, we are able to determine whether proper personal protective equipment was not worn. By assigning semantic meaning to each pixel, it is possible to differentiate between individuals and their personal protective equipment, facilitating the binary outcome of whether a worker wearing PPE, set as 1, or whether they are not wearing PPE, set as 2, or 0 for the background class.

The self-attention mechanism allows the model to focus on relevant regions and to capture fine-grained details during the segmentation process. By incorporating the self-attention mechanism into the Swin-Unet architecture, the model can effectively segment individuals wearing PPE. By combining the benefits of both, the proposed PPE_Swin model can extract high-quality features from the input data. This hybrid feature extractor enables the model to capture both local and global features, resulting in superior performance for both PPE detection and person segmentation tasks (Figure 3).

In this study, both training and testing are conducted using the proposed PPE_Swin architecture for anchor-free object detection, specifically for personal protective equipment (PPE) detection. In this hybrid model, first, an annotated image is fed into the encoder–decoder architecture model. After that, an image is fed into an encoder path, extracting high-level features and contextual information through convolutional and pooling layers. These layers progressively reduce the spatial dimensions while capturing diverse visual patterns using learned filters. Mathematically, this is represented in Equations (1) and (2), where

F_{e n c}

represents the feature maps in the encoder and

F_{p o o l}

represents the pooled feature maps.

F_{e n c} = σ {(W}_{e n c} * F_{e n c - 1} + b_{e n c})

(1)

F_{p o o l} = M a x p o o l i n g (F_{e n c})

(2)

After following each convolutional layer, element-wise Rectified Linear Unit (ReLU) activation is applied to the feature maps, introducing non-linearity and enhancing representational power. Max pooling is then used in the pooling layers to reduce spatial resolution while preserving important information. This down-samples the feature maps, focusing on salient features. The contracting path decreases spatial dimensions while increasing feature maps, capturing abstract representations. The processed data then proceed to the decoder path.

The decoder in PPE_Swin restores the spatial information lost during down-sampling and reconstructs a map of the original input size. It uses up-sampling and convolutional layers to gradually recover spatial details and reach the original resolution. At each up-sampling step, the feature maps from the corresponding encoder path are concatenated, creating skip connections. These connections combine local and global information, enhancing localization and segmentation accuracy. By leveraging skip connections, the decoder utilizes low-level details from the encoder for precise object boundary delineation. Mathematically, this is represented in Equation (3), where

F_{u p}

represents the up-sampled feature maps in the decoder.

F_{u p} = u p s a m p l e (F_{d e c - 1})

(3)

Convolutional layers within the decoder refine feature maps, capturing fine-grained details and improving segmentation accuracy. ReLU activation introduces non-linearity, enhancing the network’s ability to model complex relationships and patterns. In the expanding path, the spatial resolution is restored while the feature maps are reduced, resulting in a segmentation map that matches the original input image. Skip connections or residual connections directly connect corresponding layers in the encoder and decoder paths. They transfer low-level details from the encoder to the decoder, enhancing detection accuracy. Mathematically, this is represented in Equation (4), where

F_{c o n c a t}

represents the concatenated feature maps,

F_{e n c}

represents the encoder, and

F_{u p}

represents the up-sampling operation.

F_{c o n c a t} = c o n c a t e n a t e (F_{e n c}, F_{u p})

(4)

By concatenating encoder feature maps with up-sampled feature maps in the decoder, the network combines high-level information and fine-grained spatial details. These fusions aid accurate localization and detection. Skip connections alleviate information loss during down-sampling and enhance information flow, resulting in improved segmentation performance. The detection process under the first stage of the PPE_Swin model based on the encoder and decoder architecture typically ends with a 1 × 1 convolutional layer followed by an activation function like sigmoid or softmax. This layer generates the final segmentation map, assigning probabilities or class labels to each pixel (e.g., PPE-wearing person or non-PPE-wearing person). The 1 × 1 convolutional layer aggregates information, allowing the network to learn complex relationships and combine features across scales for accurate segmentation. The activation function (sigmoid or softmax) is applied to obtain pixel-level probabilities or class labels. Equations (5) and (6) show this, where

F_{d e c}

represents the feature maps in the decoder;

\hat{y}

represents the predicted segmentation map;

*

denotes convolutional operations;

σ

represents the ReLU activation function; and

W

and

b

represent learned weights and biases, respectively.

F_{d e c} = σ {(W}_{d e c} * F_{c o n c a t} + b_{d e c})

(5)

\hat{y} = s i g m o i d {(W}_{o u t} * F_{d e c} + b_{o u t})

(6)

In PPE_Swin, the self-attention mechanism employs its contracting and expanding paths, along with skip connections, to capture both local and global information for accurate image detection. Figure 4 shows the outputs of the Unet-based encoder and decoder architecture. By leveraging the hierarchical features learned through the encoder and decoder, the model can effectively detect and segment PPE regions in images.

This model’s encoder–decoder-based self-attention mechanism makes it a powerful tool for detecting PPE and segmenting images in scenarios such as risks associated with inadequate protection.

3.3. Segmentation Process through Self-Attention Mechanism

In this proposed framework, the Swin-Unet-based self-attention mechanism is utilized in the second stage of PPE detection to improve accuracy. This model incorporates the Swin transformer, which enhances self-attention and feature representation, resulting in superior performance and robustness to input size changes. Swin-Unet achieves state-of-the-art results in image segmentation tasks by combining the strengths of Swin transformer and Unet models [14]. The training data are effectively utilized by enhancing images through a randomized process of changing dimensions and applying HSV transformations to increase resilience.

In this proposed framework, we take the output image from the encoder–decoder architecture (based on Unet) and pass it through the Swin-Unet based self-attention mechanism to improve accuracy. Swin-Unet combines the advantages of two state-of-the-art models, Swin transformer and Unet, to achieve higher accuracy rates than traditional Unet models. Swin transformer has better modeling capabilities for spatial dependencies, while Unet is better at handling small object segmentation. Swin-Unet combines these strengths to produce better results. Additionally, the image enhancement step used in Swin-Unet training can further improve the model’s performance in all areas and utilize the training data more effectively.

In this approach, the training data are effectively utilized by improving the images in phases. As a first step, we randomly adjust the image’s width and height. In order to achieve variations in the image dimensions, the resizing is performed within a range of 70% to 130% of their original dimensions. By doing so, they will still fit within a 512 × 512 frame. As a result of resizing, the dimensions are calculated as follows. Let the original image dimensions be

w i d t h_o r i g

and

h e i g h t_o r i g

. Randomly select scaling factors

s c a l e_w i d t h

and

s c a l e_h e i g h t

in the range [0.7, 1.3]. The resized dimensions are calculated according to Equations (7) and (8).

n e w_w i d t h = w i d t h_o r i g * s c a l e_w i d t h

(7)

n e w_h e i g h t = h e i g h t_o r i g * s c a l e_h e i g h t

(8)

Although HSV is typically associated with RGB (three-channel) images, we apply a simplified version of this transformation to grayscale images. Although this modified transformation is not the traditional application of HSV, it is intended to enhance features and increase the model’s resilience. Here is a simplified mapping from grayscale values to HSV in Equations (9)–(11).

H u e (H) = G r a y s c a l e V a l u e

(9)

S a t u r a t i o n (S) = C o n s t a n t (e . g ., 1)

(10)

V a l u e (V) = G r a y s c a l e V a l u e

(11)

The hue values in each pixel are multiplied by 0.1 amplitude in order to introduce random fluctuations. Equation (12) is applied to each pixel in the HSV image.

H u e (H) + = R a n d o m V a l u e I n R a n g e (- 0.1, 0.1)

(12)

The hue values should be wrapped around to the [0, 1] range, since the hue component typically operates within this range (0 to 1). This technique is essential because medium-resolution remote sensing images cover extensive areas with multiple images. Moreover, there are color variations between these images due to the time gaps between their acquisitions. The image enhancement step of network training in Swin-Unet may effectively improve the model’s performance in all areas and utilize the training data more effectively. Since medium-decision far-reaching sensing pictures must cover a large region with more than one image and because of the time gap between each image’s acquisition, there are color changes between the numerous photographs. The resilience of the model increases with an improved HSV conversion technology.

A CNN feature extraction block called a Swin transformer is in charge of this process. The input image is reduced by one-fourth of its original length and breadth and by sixteen times its original channel following the patch partition procedure since the minimal structural unit of the Swin transformer is a 4 × 4 image element. The Swin transformer block may be viewed as a sequence of two modules, which is different from the traditional multi-head self-attention (MSA) module used in ViT (Figure 5). A shifted window-based component named MSA (SW-MSA) and a consistent window-based module named MSA (W-MSA) are found in a Swin transformer block, which is followed by a two-layer multilayer perceptron (MLP) with Gaussian error linear unit nonlinearity (GELU). Each MSA module and each MLP have a layer norm (LN) layer applied before them, and each module also has a residual connection applied after it. Mathematically, the specific calculating guidelines are shown in Equations (13)–(16).

ẑ^Ɩ = W-MSA (LN (ᴢ^Ɩ−1)) + ᴢ^Ɩ−1

(13)

ᴢ^{Ɩ} = MLP (LN (ẑ^{Ɩ})) + ẑ^{Ɩ}

(14)

ẑ^Ɩ+1 = SW-MSA (LN (ᴢ^Ɩ)) + ᴢ^Ɩ

(15)

ᴢ^Ɩ+1 = W-MSA (LN (ẑ^Ɩ+1)) + ẑ^Ɩ+1

(16)

where ᴢ^Ɩ+1 is the output features of the (S)W-MSA module and ᴢ^Ɩ is the output features of the MLP module, where l represents the number of blocks.

A t t e n t i o n (Q . K, V) = S o f t M a x (\frac{Q K^{T}}{\sqrt{d}} + B) V

where Q, K, V ∈

R^{M^{2 \times d}}

, and here,

Q

denotes the query, whereas

K

denotes key and

V

denotes value matrices. M² and d represent the number of patches in a window and the dimension of the query or key, respectively. And, the values in B are taken from the bias matrix

\hat{B}

∈ R^{(2M−1)×(2M+1)}.

3.4. Training Process

In order to train the PPE_Swin model, the parameters were set according to the interests of the authors and empirical trials. In the initial warm-up period, five epochs were conducted with a learning rate of 0.01. As a result of these warm-up epochs, the network was able to gradually adapt to the dataset. After experimenting with different values, the authors found that the aforementioned setup performed well. After the warm-up epochs, the learning rate was adjusted according to a cosine learning rate schedule, following Equation (17). A cosine learning rate schedule was chosen because it has a rapidly decreasing nature, reaching a minimum learning rate value before increasing rapidly again. This resetting of the learning rate acts as a simulated restart of the learning process. The input image size for training the PPE_Swin model was set to 512 × 512. A total of 50 epochs were performed to train the model.

i r = 0.5 * (1.0 + c o s (p i * \frac{i t e r a t i o n}{t o t a l i t e r a t i o n}))

(17)

To facilitate the segmentation process, the chosen dataset was meticulously tagged using Label Studio, a tool specifically designed for data annotation and labeling. By leveraging Label Studio, each image was labeled to indicate whether it belonged to the regular laborer class or the fieldworker class. This annotation step ensured the dataset had accurate and consistent labels, enabling the training algorithm to effectively learn the distinguishing characteristics of each class. To properly evaluate and validate the trained model, the dataset was divided into three subsets: training, validation, and testing sets. A random split strategy was employed, allocating 70% of the dataset for training, while 15% each was assigned to the validation and testing sets, respectively. Table 3 illustrates the dataset split between the two classes “Person with PPE” and “Person without PPE”. This partitioning ensures that the model’s performance is assessed on both seen and unseen data, allowing for a comprehensive evaluation of its generalization capability. Furthermore, a class-wise separation analysis was conducted to verify the suitable representation of the two classes across the dataset subsets. This analysis ensured that each subset had a balanced distribution of regular laborer and fieldworker images, reducing the risk of bias and ensuring that the model learns and performs well for both classes. In order to create a segmentation map and to determine PPE compliance, we follow some simple steps. Initially, we set the flag to false. We then examine each segmented region within the map, looking for specific criteria, such as color, shape, or size, to indicate the absence of personal protective equipment. In the event that any segment meets the criteria, the flag is set to “true” and the assessment is terminated. When all segments have been evaluated, we look at the flag; if it is “true”, we return “worker without PPE” to indicate that PPE is not worn, and if it remains “false”, we return “worker wearing PPE” to indicate that PPE is worn. The algorithm simplifies the process of identifying PPE absence in a segmentation map, enabling binary outcomes for safety compliance decisions and alerts that can be tailored to meet a variety of requirements.

Platform Parameters and Computational Resources

To test an algorithm’s performance on specific data, the hardware setup includes an Xshell-enabled Nvidia GeForce RTX 3090 GPU and an Intel Core i5-7200U CPU running at 2.50 and 2.71 GHz on a Linux OS. The essential ecosystem consists of Python 3, Pip, OpenCV, TensorFlow, and Keras, while Label Studio is used for image annotations. The Notepad++ text editor and Jupyter Notebook are used for initial development. Swin-Unet and Unet models use transfer learning on the Keras Vision Transformer backend. The final output layer is modified to output two classes, namely “person with PPE” and “person without PPE”.

Before cloning and installing the model with Keras as the backbone, we configured the GPU environment on Xshell for pre-processing. Once the installation was confirmed, the dataset was prepared for input and subsequently used to train each model individually while establishing parameters. Table 4 shows the description of model parameters that were established during the training time.

3.5. Data Preprocessing

Multiple geometric changes are applied to the images, including mosaic augmentation and random affine transformations during test-time augmentation. The random affine transformation involves rotation on both axes within the range of −10 degrees to +10 degrees. The translation is performed on both the X and Y axes within the range of 0.4 to 0.6. Scaling is applied on both axes with a range of 0.1 to 2, while shear is performed with the same amount on both axes within the range of −2 to +2. These values are uniformly randomly selected within their respective ranges. In addition to geometric changes, photometric alterations are applied by adjusting the brightness, contrast, hue, and saturation of the images. The Random Horizontal Flip and Random Resized Crop strategies are also employed as part of the data preprocessing. The authors performed the test-time augmentation to enhance the robustness and generalization of the model predictions during inference. By applying these transformations to the input images during testing, the model can account for potential variations in the test data and provide more reliable predictions.

Augmentation Technique to Add Effects

In our study, we have demonstrated that incorporating various photometric changes during the training phase enhances the robustness of our machine learning model to recognize PPE objects under challenging real-world conditions. In addition to improved generalization, our model gains the ability to recognize objects effectively across diverse environmental contexts, even those that are not explicitly present in the training data, when exposed to artificial images representing scenarios such as haze, rain, and low light. Additionally, this approach mitigates overfitting, which is a common problem in machine learning. Due to this adaptation, our model is less reliant on specific, narrow features and is better suited to recognize general objects. As a result, we find that it promotes the development of more robust feature extraction techniques, such as prioritizing contour and edge-based information when working in low-light conditions.

Furthermore, our approach enhances the model’s ability to adapt to unforeseen conditions. It becomes more capable of handling dynamic and unpredictable real-world scenarios, which is particularly important for applications where object recognition is of paramount importance, such as autonomous vehicles and surveillance systems.

The generation of artificial images provides a controlled environment for testing and benchmarking the model’s robustness. Algorithm 1 shows the creation of artificially created rain effects in dataset images. It is important to emphasize that the study focuses on real-life challenges and does not involve the inclusion of the affected images in the training dataset.

Algorithm 1. Creating rain effect on PPE dataset

Requirement for generating random rain droplets = image Size, max rain Slant, RD length
Generate: rain-droplets = Φ, d = 0
Ndroplets = Number of rain droplets
Effected image = Create a 2D random array with dimensions matching those of the input PPE image
while i < Ndroplets
do
if max rain slant < 0 then
a = random (max rain Slant, image size [1])
else
a = random (max rain Slant, image size [1]—max rain Slant)
end if
b = random (0, image size [0]—RD length)
Rain Drops = join (x, y)
i = i + 1
end while

The initial step involves creating a hazy image using a 2D array of the same dimensions as the original image. A noise effect is used to enhance the realism of the artificially generated image. The severity of the noise is controlled by the noise amplitude, while the noise balance determines its brightness. By incorporating noise into the image, variations similar to those found in real hazy images are simulated. The addition of the noise is applied uniformly across all color channels to maintain the image’s color properties. Finally, the pixel values of the hazy image are adjusted by comparing the maximum pixel values of both the hazy image and the original image. This adjustment process ensures that the hazy image aligns with the desired haze intensity while preserving important details from the original image. Algorithm 2, shows that a faithful representation of haze can be created accurately and logically while maintaining a plagiarism-free approach. In this study, the algorithm incorporates various parameters to simulate raindrops on an image. The slant of the raindrops is achieved by randomly selecting values within the range of (−10, 10). The height and width of the raindrop are determined to be 1.5% and 0.15%, respectively, relative to the dimensions of the original image. The color of the raindrop is set to RGB (182, 197, 243), representing a specific shade chosen by the author. To create a realistic rainwater effect, the algorithm proceeds by drawing a predefined number of raindrops on the image. These raindrops are positioned and sized based on the aforementioned parameters. Furthermore, to enhance the rainwater effect, the image is subjected to an average kernel of size 5, resulting in a slight blurring effect. Additionally, to mimic the shadowy appearance often associated with rainy conditions, the brightness of the image is reduced by 20% compared to the original image, as rainwater images tend to have lower overall brightness. The objective of this study is to produce a low-light image by effectively managing the brightness levels of the original image. Through a deliberate reduction of 50% in brightness compared to the original, a meticulously crafted low-light conditioned image is achieved.

Algorithm 2. Creating haze effect on the PPE dataset

Requirement: dataset image, noise effect, noise balance
Effected image = Create a 2D random array with dimensions matching those of the input PPE image
Effected image with noise (EI) = noise amplitude × EI + noise balance
Maximum input (MI) = MI (PPE image)
Effected image with haze (EH) = PPE image + EI = +250
Maximum haze input = max (EH)
Final Haze effected image = EH × MI/maximum haze input

4. Quantifying the Performance of the Model

The performance of our model in the context of semantic segmentation is evaluated using two key metrics, mean intersection over union (MIoU) and F1 score (Dice coefficient).

Mean intersection over union (MIoU): In our study, MIoU provides insight into the model’s ability to segment objects accurately across different categories. In order to calculate the mean, the intersection over union of each class is measured and the mean is then determined. We use Equation (18) to calculate the MIoU in this study. In the intersection over union (IoU), the predicted segmentation mask is compared with the ground truth for a particular class. In general, a higher IoU indicates a more accurate delineation of objects. The MIoU provides a holistic view of segmentation accuracy by averaging IoU values across all classes.

M I o U = \frac{1}{N} \sum_{i = 1}^{N} {I o U}_{i}

(18)

{I o U}_{i} = \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(19)

In our study, we used Equation (14) to accurate measure the F1 score of each class specifically. In Equation (19), the

{I o U}_{i}

represents the intersection over union for class

i

.

{T P}_{i}

represents the true positive for class

i

, whereas

{F P}_{i}

and

{F N}_{i}

show the false positive and false negative for class

i

.

5. Results

The results obtained from our proposed model, PPE_Swin, demonstrate the remarkable accuracy and speed in detecting and segmenting PPE. During the training process of our proposed model, PPE_Swin, we conducted multiple iterations known as epochs. Each epoch represents a complete pass through the entire training dataset, where the model updates its parameters based on the calculated loss and gradient descent optimization. By training the model for 50 epochs, we allowed it to gradually learn and refine its internal representations of PPE objects. The initial loss value of 3.95 indicates the discrepancy between the predicted outputs of the model and the ground truth labels at the beginning of the training process shown in Figure 6. As the model learned from the data and adjusted its parameters, the loss steadily decreased throughout the epochs. This loss reduction is an indication of the model’s ability to capture and understand the underlying patterns and features associated with PPE objects in the training data. By the end of the training process, the model achieved an impressive loss reduction, reaching a value of 97%. This significant decrease in loss signifies that the model has successfully learned to accurately detect and segment PPE items in the given images. A lower loss value indicates a higher level of agreement between the predicted outputs and the ground truth labels, demonstrating the effectiveness of our model in capturing the intricate details and characteristics of PPE objects. The substantial reduction in loss achieved through the training process is a testament to the model’s learning capacity and its ability to optimize its parameters to minimize the discrepancy between predictions and ground truth.

This indicates the effectiveness of the training process and the model’s ability to learn and adapt to the task at hand. To evaluate the accuracy of the model, we compared its predicted results with the ground truth labels, ensuring a rigorous assessment of its performance. Figure 7 shows below the MIoU scores indicating that the proposed model has displayed exceptional segmentation performance during our extensive evaluation. Figure 8 visually represents the comparison between the predicted segmentation outputs and the corresponding ground truth masks. This comprehensive evaluation approach provides a clear indication of the model’s ability to accurately identify and delineate PPE in diverse images from construction sites.

This information provides valuable insight into the model’s ability to accurately identify objects and regions within the images, including the identification of workers who are wearing personal protective equipment (PPE). The MIoU score of 0.9689 for the background class, which represents regions in the images that are not associated with workers or safety gear, is impressively high. Using this score, the model is able to segment the background accurately and to distinguish it from other objects within the image. In other words, it is a reflection of how well the model is able to recognize and isolate non-relevant areas.

Further, the MIoU score of 0.9705 for workers with PPE demonstrates the model’s superior segmentation capability in identifying instances where safety gear is appropriately worn. This high score underscores the model’s precision in highlighting regions where PPE is present, contributing to enhanced workplace safety. Similarly, the MIoU score of 0.9698 for workers without PPE showcases the model’s effectiveness in recognizing areas where safety attire is absent. This ability is instrumental in identifying potential safety hazards and ensuring compliance with safety regulations.

Table 5, presenting the results of the PPE_Swin model, provides information on three critical classes: “Background class”, “Worker wearing personal protective equipment”, and “Worker without personal protective equipment”. This model demonstrated high proficiency in detecting the absence of personal protective equipment in the “Background class” by achieving a mean IoU of 0.9689 and a Dice coefficient (F1 score) of 0.9705. As for the “Worker wearing PPE”, the model demonstrated accuracy with a mean IoU of 0.9612 and an F1 Score of 0.9698. With a mean IoU of 0.9698 and a Dice coefficient (F1 score) of 0.9724, it excelled in identifying situations where personal protective equipment is not worn properly. As a result, our proposed PPE_Swin model has demonstrated exceptional performance and reliability in detecting and monitoring PPE compliance. Overall, these MIoU scores demonstrate the robustness and accuracy of the proposed model for semantic segmentation tasks related to worker safety attire detection. It is capable of segmenting a wide range of classes, including background regions, workers with personal protective equipment, and workers without PPE, making it a useful tool for ensuring workplace safety and compliance with safety regulations. In addition to demonstrating the model’s reliability and precision, these results represent an important advancement in the field of computer vision for safety-critical applications.

The high accuracy achieved by our model is a testament to its robustness and reliability in real-world scenarios. It successfully handles various challenges encountered in construction environments, including varying lighting conditions, occlusions, and complex backgrounds. By training on a diverse dataset that reflects these real-world complexities, our model exhibits excellent generalization capabilities, allowing it to accurately detect and segment PPE across a wide range of challenging situations. The combination of high accuracy and fast processing speed makes our proposed model well suited for real-time applications, where timely detection of PPE is crucial for ensuring worker safety. The accurate segmentation results obtained by PPE_Swin enable precise identification of PPE items, facilitating compliance with safety regulations and enhancing overall safety practices on construction sites.

The accuracy of the model was further evaluated by detecting persons without personal protective equipment (PPE) and persons with PPE under different dataset combinations and the presence of various environmental effects, including haze, rain, and brightness. The results indicate in Table 6 indicate the model’s performance on each class for different dataset effects. For persons without PPE, the model achieved an accuracy of 96.1% under the haze effect, indicating its ability to accurately identify individuals who are not wearing the necessary protective gear in hazy conditions. Similarly, the model performed well under the rain effect, with an accuracy of 97.1%, showcasing its robustness in detecting individuals without PPE even in rainy environments. Under the brightness effect, the model achieved an accuracy of 95.2%, demonstrating its capability to distinguish persons without PPE in scenarios with varying levels of brightness. When it comes to detecting persons with PPE, the model showed higher accuracy across all the environmental effects. Under the haze effect, the model achieved an accuracy of 97.2%, showcasing its effectiveness in recognizing individuals who are correctly wearing protective equipment despite hazy conditions. The rain effect resulted in an accuracy of 97.6%, indicating the model’s ability to accurately identify individuals with PPE even in rainy environments. Similarly, under the brightness effect, the model achieved an accuracy of 96.9%, demonstrating its capability to recognize individuals wearing PPE under different levels of brightness. These results highlight the model’s effectiveness in detecting both persons with and without PPE across various environmental conditions. The model exhibits a high level of accuracy, providing reliable identification of individuals in different scenarios. By considering the dataset variations and the effects of environmental factors, the model can adapt and make accurate predictions in real-world settings, ensuring proper safety compliance.

It is important to note that the model’s performance may vary under different environmental effects due to the visual distortions caused by these factors shown in Table 6. However, the achieved accuracies demonstrate the model’s robustness and capability to overcome these challenges to provide accurate detection of individuals with and without PPE. Overall, the study’s findings confirm the model’s effectiveness and reliability in identifying persons with and without PPE under different dataset combinations and environmental effects. This knowledge can be utilized to further optimize the model’s performance and to enhance workplace safety measures by ensuring proper PPE compliance in various challenging environments.

5.1. Comparison with the State-of-the-Art Studies

In this study, we compared the performance of our proposed model, PPE_Swin, with state-of-the-art techniques for PPE detection in computer vision. Table 7 provides a comprehensive overview of the comparative results, including the author, year of publication, objective, method used, and corresponding accuracy.

The author Q Feng, in 2018, focused on detecting non-hard-hat use (NHU) using RCNNs, achieving an impressive accuracy of 94.9%. This method employed region-based convolutional neural networks (RCNNs) and demonstrated promising results in accurately identifying instances of NHU. After that, another study presented by Venkata Santosh in 2020 aimed to detect PPE across four different classes using YOLOv3, an established object detection algorithm. Their model achieved an accuracy of 96%, demonstrating its effectiveness in accurately identifying and classifying PPE items. In 2021, zijian wang extended the scope of PPE detection to multiple classes using YOLOv5, a variant of the YOLO algorithm. Their model achieved an accuracy of 86.55%, providing insights into the detection of various PPE items in diverse contexts. The author S Mathur introduced a method in 2023 for detecting both workers and PPE items using Mask RCNN, a powerful instance segmentation technique. Their model achieved an accuracy of 88.60%, showcasing its ability to accurately localize and classify workers and PPE objects simultaneously. In comparison, our proposed model, PPE_Swin, exhibited superior performance in terms of accuracy and efficiency. After extensive training, our model achieved an impressive accuracy of 97% in detecting and segmenting PPE objects. PPE_Swin leverages a novel architecture that combines attention mechanisms and transformer models, enabling it to capture intricate details and features crucial for accurate PPE detection.

5.2. Comparison with the State-of-the-Art Models with the Same Dataset of PPE

In this study, we also evaluated the comparison of various state-of-the-art models for PPE detection, including RCNN, Mask R-CNN, YOLOv3, and YOLOv5, with our proposed model on the same dataset shown in Table 8. The objective was to compare their effectiveness in detecting PPE across different classes. We divided the dataset into training, validation, and testing sets, with proportions of 70%, 15%, and 15%, respectively. The RCNN model achieved a result of 0.89, indicating an accuracy of 89% in identifying PPE items. This model utilized a region-based approach and demonstrated promising performance in PPE detection. Mask R-CNN, which incorporated instance segmentation, showed improved accuracy compared to RCNN, with a result of 0.91. The model’s ability to precisely delineate the boundaries of PPE objects contributed to its enhanced performance. Both YOLOv3 and YOLOv5 models yielded similar results, with an accuracy of 0.93. These models employed a single-shot detection approach, which enabled fast and efficient detection of PPE items.

However, our proposed model outperformed all the other models, achieving an impressive result of 0.97. This indicates a high accuracy of 97% in accurately detecting and segmenting PPE items. The superior performance of our model can be attributed to its unique architecture and advanced training process, which effectively captured intricate details and features crucial for precise PPE detection. We further evaluated the performance of the models in distinguishing between people wearing PPE and those without PPE. The training and testing proportions for each class were set at 30%, 5%, and 5% for individuals with PPE and 40%, 10%, and 10% for individuals without PPE.

The comparative results indicate that our proposed model achieved the highest accuracy for both classes, demonstrating its efficacy in accurately detecting PPE and distinguishing between individuals wearing and not wearing PPE. This highlights the importance of our model in promoting workplace safety by identifying instances of non-compliance with PPE protocols. The findings from this comparison underscore the significance of advanced models and training techniques in achieving accurate and reliable PPE detection. Our proposed model’s superior performance paves the way for improved safety measures and enhanced compliance with PPE requirements in various industries. It is important to note that the accuracy results presented here are based on the specific dataset and experimental setup used in our study. Further evaluation and validation on larger and more diverse datasets would be beneficial to gain a more comprehensive understanding of the model’s performance in real-world scenarios. The system’s adaptability and effectiveness can be enhanced through ongoing updates and refinements once it has been deployed. In order to keep the model up to date with the latest PPE standards, equipment, and industry-specific nuances, regular updates are essential. In addition, it is imperative to ensure that the model remains proficient in recognizing and assessing compliance with PPE requirements through continuous training. Training can involve feeding the model with new data, industry-specific information, and evolving safety regulations.

It is important to periodically re-evaluate the model against new state-of-the-art models and methods in addition to updates and training. Machine learning and natural language processing are fields that are constantly evolving, and new models with improved capabilities are developed on a regular basis. In order to maintain a high level of performance, the deployed system should be benchmarked against these newer models in order to determine its competitiveness. Depending on the pace of advancements in the field and the specific needs of the industry, re-evaluation may be required annually or on a schedule aligned with regulatory updates. The deployment and maintenance of a PPE compliance system should be viewed as a continuous process requiring regular updates, continuous training, and periodic evaluations against the latest technology and safety standards. As a result of this approach, the system remains at the forefront of safety and compliance in its respective field.

6. Ablation Studies

We conducted ablation studies to assess our PPE detection model’s impact and effectiveness. We found that the self-attention mechanism significantly improved robustness in various image conditions. These studies provide critical insights into our model’s components, capabilities, and limitations.

6.1. Comparing Effectiveness of Correct Usage of PPE Detection without Self-Attention Mechanism-Based Swin-Unet Architecture

We evaluated our PPE detection model based on self-attention. When we excluded the self-attention mechanism based on the Swin-Unet architecture, we observed strong performances across different PPE classes, shown in Table 9. For background detection, the model achieved a mean IoU of 0.9291 and an F1 score of 0.9327. It excelled in distinguishing between “Worker not wearing PPE” and “Worker wearing PPE” with mean IoU values of 0.9231 and 0.9164 and F1 scores of 0.9354 and 0.9261, respectively. These results underscore the model’s effectiveness in correct PPE detection and safety monitoring, even without S-A-M.

6.2. Effectiveness of Correct Usage of PPE Detection without Improved Skip-Connection-Based Unet Architecture

In order to evaluate the performance of our correct-use-of-PPE detection model, we conducted an experiment that was solely based on the self-attention mechanism within the Swin-Unet architecture, excluding the encoder–decoder-based Unet model. As a result of the self-attention mechanism, significant improvements were achieved in identifying PPE detection across a variety of image conditions.

Table 10 presents detailed performance metrics for our PPE detection model, focusing on three key classes: “Background class”, “Worker wearing PPE”, and “Worker without PPE”. For the “Background class”, the model achieved an impressive mean IoU of 0.9421 and a Dice coefficient (F1 score) of 0.9477, indicating its robustness in detecting the absence of PPE. In the case of “Worker wearing PPE,” the model attained a mean IoU of 0.9467 and an F1 score of 0.9521, demonstrating its accuracy in identifying proper PPE usage. For “Worker without PPE”, the model achieved a mean IoU of 0.9367 and an F1 score of 0.9489, highlighting its proficiency in recognizing instances where PPE is not worn. These results underscore the model’s competence in detecting PPE usage, emphasizing its reliability in safety monitoring.

7. Conclusions

The use of proper safety gear protects workers from unwanted accidents and serious injuries on construction sites. Each year, a significant number of workers suffer severe injuries on construction sites, resulting in long-lasting challenges and hardships. These statistics highlight the grave impact of such incidents on the lives of the individuals involved. However, it is imperative to ensure workers’ safety by providing automatic and robust solutions that detect hazards in real time under different environmental conditions. As a result, artificial intelligence (AI) offers promising solutions for automating PPE detection and compliance. While research in this area is still in its early stages, many deep learning-based CNN and ViT models provide comprehensive solution, but they are not robust enough to detect and generalize PPE under different environmental conditions, as well as to extract local and global features. In this study, the proposed PPE_Swin model used the hybrid approach by combining the strengths of the Swin-Unet-based self-attention mechanism and Unet-based encoder–decoder architecture to demonstrate strong performance in accurately identifying and segmenting persons with and without PPE in diverse construction site environments. The model exhibits enhanced robustness by extracting both local and global features, enabling it to effectively handle different environmental factors such as rain and haze. In comparison with other state-of-the-art methods, our model yields the highest accuracy of 97% in detecting workers with and without personal protective equipment. The authors believe the proposed approach holds the potential to significantly enhance worker safety and reduce risks associated with inadequate protection.

Author Contributions

Conceptualization, M.R. and J.H.; Methodology, M.R., K.X. and J.H.; Software, M.R.; Validation, S.A.M., K.X., H.S.A. and J.H.; Formal Analysis, J.H., S.A.M., H.A.A.A. and K.X.; Investigation, M.R.; Resources, H.S.A. and W.J.O.; Data Curation, H.A.A.A. and W.J.O.; Writing—Original Draft Preparation, J.H. and M.R; Writing—Review and Editing, S.A.M., K.X., H.S.A. and W.J.O.; Visualization, J.H. and M.R; Supervision, J.H.; Project Administration, J.H., S.A.M. and K.X.; Funding Acquisition, H.S.A., H.A.A.A. and W.J.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-RG23148).

Data Availability Statement

The Dataset will be provided on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shen, L.Y.; Bao, H.J.; Wu, Y.Z.; Lu, W.S. Using Bargaining-Game Theory for Negotiating Concession Period for BOT-Type Contract. J. Constr. Eng. Manag. 2007, 133, 385–392. [Google Scholar] [CrossRef]
Standing, G. The ILO: An Agency for Globalization? Dev. Chang. 2008, 39, 355–384. [Google Scholar] [CrossRef]
Lingard, H.; Rowlinson, S. Occupational Health and Safety in Construction Project Management; Taylor & Francis Ltd.: London, UK, 2004. [Google Scholar]
Gambatese, J.; Hinze, J. Addressing construction worker safety in the design phase. In The Organization and Management of Construction; Elsevier: Amsterdam, The Netherlands, 1999. [Google Scholar]
Wang, Z.; Wu, Y.; Yang, L.; Thirunavukarasu, A.; Evison, C.; Zhao, Y. Fast personal protective equipment detection for real construction sites using deep learning approaches. Sensors 2021, 21, 3478. [Google Scholar] [CrossRef] [PubMed]
Reese, C.D.; Eidson, J.V. OSHA, Safety and health regulations for construction. In Handbook of OSHA Construction Safety and Health; CRC Press Inc.: Boca Raton, FL, USA, 2006. [Google Scholar]
Behzadan, A.H.; Nath, N.D.; Akhavian, R. Artificial Intelligence in the Construction Industry. In Leveraging Artificial Intelligence in Engineering, Management, and Safety of Infrastructure; CRC Press Inc.: Boca Raton, FL, USA, 2022; pp. 348–379. [Google Scholar]
Mneymneh, B.E.; Abbas, M.; Khoury, H. Vision-Based Framework for Intelligent Monitoring of Hardhat Wearing on Construction Sites. J. Comput. Civ. Eng. 2019, 33, 04018066. [Google Scholar] [CrossRef]
Zhang, S.; Teizer, J.; Pradhananga, N.; Eastman, C.M. Workforce location tracking to model, visualize and analyze workspace requirements in building information models for construction safety planning. Autom. Constr. 2015, 60, 74–86. [Google Scholar] [CrossRef]
Kelm, A.; Laußat, L.; Meins-Becker, A.; Platz, D.; Khazaee, M.J.; Costin, A.M.; Helmus, M.; Teizer, J. Mobile passive Radio Frequency Identification (RFID) portal for automated and rapid control of Personal Protective Equipment (PPE) on construction sites. Autom. Constr. 2013, 36, 38–52. [Google Scholar] [CrossRef]
Zhang, H.; Yan, X.; Li, H.; Jin, R.; Fu, H. Real-Time Alarming, Monitoring, and Locating for Non-Hard-Hat Use in Construction. J. Constr. Eng. Manag. 2019, 145, 04019006. [Google Scholar] [CrossRef]
Seo, J.; Han, S.; Lee, S.; Kim, H. Computer vision techniques for construction safety and health monitoring. Adv. Eng. Inform. 2015, 29, 239–251. [Google Scholar] [CrossRef]
Han, S.; Lee, S. A vision-based motion capture and recognition framework for behavior-based safety management. Autom. Constr. 2013, 35, 131–141. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Yang, Y.; Zhang, X.; Guan, Q.; Lin, Y. Making Invisible Visible: Data-Driven Seismic Inversion with Spatio-Temporally Constrained Data Augmentation. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 4507616. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Cheng, T.; Teizer, J. Modeling Tower Crane Operator Visibility to Minimize the Risk of Limited Situational Awareness. J. Comput. Civ. Eng. 2014, 28, 04014004. [Google Scholar] [CrossRef]
Shrestha, K.; Shrestha, P.P.; Bajracharya, D.; Yfantis, E.A. Hard-Hat Detection for Construction Safety Visualization. J. Constr. Eng. 2015, 2015, 721380. [Google Scholar] [CrossRef]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat-use by a deep learning method from far-field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Mathur, S.; Jain, T. Segmenting Personal Protective Equipment Using Mask R-CNN. In Proceedings of the 2023 11th International Conference on Internet of Everything, Microwave Engineering, Communication, and Networks (IEMECON), Jaipur, India, 10–11 February 2023. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Delhi, V.S.; Sankarlal, R.; Thomas, A. Detection of personal protective equipment (PPE) compliance on construction site using computer vision based deep learning techniques. Front. Built Environ. 2020, 6, 136. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Zhang, X.; Sun, J. Object Detection Networks on Convolutional Feature Maps. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1476–1481. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer International Publishing: New York, NY, USA, 2015; Volume 9351, pp. 234–241. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
Alahmari, F.; Naim, A.; Alqahtani, H. E-Learning Modeling Technique and Convolution Neural Networks in Online Education. In IoT-enabled Convolutional Neural Networks: Techniques and Applications; River Publishers: Roma, Italy, 2023; pp. 261–295. [Google Scholar]
Krichen, M. Convolutional Neural Networks: A Survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]

Figure 1. Dataset images of workers of different classes from different construction backgrounds. (A) Workers with PPE. (B) Workers without PPE.

Figure 2. Research method pipeline used for this study.

Figure 3. The framework of the proposed system for detection of PPE using PPE_Swin based on S-A-M.

Figure 4. Outputs from Unet-based encoder and decoder architecture (A) showing a person with PPE and (B) showing a person without PPE in blue and green.

Figure 5. Swin transformer block.

Figure 6. Training accuracy of PPE_Swin over the first 50 epochs.

Figure 7. MIOU for PPE detection. Class 1 represents background class. Class 2 represents worker with PPE, and Class 3 represents worker without PPE.

Figure 8. Output final results of segmentation from the image dataset.

Table 1. Literature on computer vision techniques to detect PPE on construction sites.

Author	Year	Objective	Method	Accuracy
Q Feng [19]	2018	Detection of NHU	RCNNs	94.9%
Venkata Santosh [22]	2020	Detection of PPE using four classes	YOLO v3	96%
Zijian Wang [16]	2021	Detection of PPE using different classes	YOLOv5	86.55%
S Mathur [20]	2023	Detect workers and PPEs	Mask RCNN	88.60%

Table 2. Dataset description.

Number of Classes	Number of Objects	Total No. of Objects	Number of Images
Person with PPE	2402	5964	1109
Person without PPE	3562	5964	1109

Table 3. Distribution of data points between training, validation, and testing.

Dataset Division	Number of Classes	Training Set	Validation Set	Testing Set
Number of images	1600	70%	15%	15%
Data points per class	Worker with PPE	40%	5%	5%
Data points per class	Worker without PPE	30%	10%	10%

Table 4. Model description and parameters.

Model	Depth	Width	Parameters (M)	Training Time (Hours)
PPE_Swin	1.0	1.0	54.2	11

Table 5. Results of PPE_Swin model based on MIOU and F1 score.

Class	Mean IoU	Dice Coefficient (F1 Score)
Background class	0.9689	0.9612
Worker wearing PPE	0.9705	0.9698
Worker without Wearing PPE	0.9698	0.9724

Table 6. Performance of PPE_Swin in different dataset combinations.

Different Combinations of Dataset Effects	Person without PPE	Person with PPE
Haze effect	96.6	97.2
Rain effect	97.1	97.6
Brightness effect	95.2	96.9

Table 7. Comparison with the state-of-the-art studies.

Author	Year	Objective	Method	Accuracy
Q Feng [19]	2018	Detection of NHU	RCNNs	94.9%
Venkata Santosh [22]	2020	Detection of PPE using four classes	YOLO v3	96%
Zijian Wang [16]	2021	Detection of PPE using different classes	YOLOv5	86.55%
S Mathur [20]	2023	Detect workers and PPEs	Mask RCNN	88.60%
Ours	2023	Detection of Workers with and without PPE	Swin-Unet	97%

Table 8. Comparison of state-of-the-art models with our proposed method.

Dataset	Models	F1 Score (Dice Coefficient)
PPE novel Dataset	RCNNs	0.89
	Mask RCNN	0.91
	YOLO v3	0.93
	YOLOv5	0.93
	PPE_Swin	0.97

Table 9. Results of encoder–decoder-based Unet model without self-attention mechanism.

Class	Mean IoU	Dice Coefficient (F1 Score)
Background class	0.9291	0.9327
Worker wearing PPE	0.9164	0.9261
Worker without Wearing PPE	0.9231	0.9354

Table 10. Results of detection of correct usage of PPE with S-A-M-based Swin-Unet and without encoder–decoder-based Unet.

Class	Mean IoU	Dice Coefficient (F1 Score)
Background class	0.9421	0.9477
Worker wearing PPE	0.9467	0.9521
Worker without Wearing PPE	0.9367	0.9489

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Riaz, M.; He, J.; Xie, K.; Alsagri, H.S.; Moqurrab, S.A.; Alhakbani, H.A.A.; Obidallah, W.J. Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection. Electronics 2023, 12, 4675. https://doi.org/10.3390/electronics12224675

AMA Style

Riaz M, He J, Xie K, Alsagri HS, Moqurrab SA, Alhakbani HAA, Obidallah WJ. Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection. Electronics. 2023; 12(22):4675. https://doi.org/10.3390/electronics12224675

Chicago/Turabian Style

Riaz, Mudassar, Jianbiao He, Kai Xie, Hatoon S. Alsagri, Syed Atif Moqurrab, Haya Abdullah A. Alhakbani, and Waeal J. Obidallah. 2023. "Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection" Electronics 12, no. 22: 4675. https://doi.org/10.3390/electronics12224675

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection

Abstract

1. Introduction

2. Related Work

Role of Computer Vision and Sensor-Based Technologies

3. Materials and Methods

3.1. Dataset Preparation

3.2. The Proposed Framework

3.3. Segmentation Process through Self-Attention Mechanism

3.4. Training Process

Platform Parameters and Computational Resources

3.5. Data Preprocessing

Augmentation Technique to Add Effects

4. Quantifying the Performance of the Model

5. Results

5.1. Comparison with the State-of-the-Art Studies

5.2. Comparison with the State-of-the-Art Models with the Same Dataset of PPE

6. Ablation Studies

6.1. Comparing Effectiveness of Correct Usage of PPE Detection without Self-Attention Mechanism-Based Swin-Unet Architecture

6.2. Effectiveness of Correct Usage of PPE Detection without Improved Skip-Connection-Based Unet Architecture

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI