Head-Integrated Detecting Method for Workers under Complex Construction Scenarios

Liu, Yongyue; Zhou, Zhenzong; Wang, Yaowu; Sun, Chengshuang

doi:10.3390/buildings14040859

Open AccessArticle

Head-Integrated Detecting Method for Workers under Complex Construction Scenarios

¹

School of Civil Engineering, Harbin Institute of Technology, Harbin 150090, China

²

School of Urban Economics and Management, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(4), 859; https://doi.org/10.3390/buildings14040859

Submission received: 9 February 2024 / Revised: 19 March 2024 / Accepted: 19 March 2024 / Published: 22 March 2024

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

:

Real-time detection of workers is crucial in construction safety management. Deep learning-based detecting methods are valuable, but always challenged by the possibility of target missing or identity errors under complex scenarios. To address these limitations, previous research depended on re-training for new models or datasets, which are prohibitively time-consuming and incur high computing demands. However, we demonstrate that the better detecting model might not rely on more re-training of weights; instead, a training-free model can achieve even better performance by integrating head information. In this paper, a new head-detecting branch (55 MB) is added to the Keypoint Region-based Convolutional Network (Keypoint R-CNN, 226 MB) without altering its original weights, allowing for a less occluded head to aid in body detection. We also deployed motion information and anthropometric data through a post-processing module to calculate movement relationships. This study achieved an identity F1-score (IDF1) of 97.609%, recall (Rcll) of 98.173%, precision (Prcn) of 97.052%, and accuracy of 95.329% as a state-of-the-art (SOTA) method for worker detection. This exploration breaks the inertial attitudes of re-training dependency and accelerates the application of universal models, in addition to reducing the computational difficulty for most construction sites, especially in scenarios with an insufficient graphics processing unit (GPU). More importantly, this study can address occlusion challenges effectively in the worker detection field, making it of practical significance.

Keywords:

deep neural networks; worker detecting; head-integrated; convolutional neural network; post-processing

1. Introduction

Construction workers encounter various job-related hazards, such as falls from heights, collisions with heavy equipment, and pedestrian accidents [1]. A report by the U.S. Department of Labor [2] reveals that the construction industry recorded 726 fatalities in 2021, many of which were caused by limited vision or the slow reactions of workers. Real-time monitoring of workers’ positions can help mitigate hazards [3] such as unauthorized activities or entering risk zones; therefore, U.S. employers must employ proper monitoring measures [4] to minimize hazards.

Over the past several years, the construction industry has witnessed a significant increase in the adoption of artificial intelligence and computer vision technologies. These cutting-edge solutions provide intelligent monitoring and automation capabilities, which are now deemed essential in this sector. Motion detection technology is a key player in achieving this objective, as it enables real-time monitoring of workers’ movements [3], thereby enabling automated alerts and intelligent scheduling. Consequently, the construction process becomes more streamlined and productive.

Options for detecting workers include manual observation, biosensors [5,6], and vision-based inspections [7]. Manual observation involves hiring professional staff and incurring the necessary labor costs, creating a system in which supervisors are competent in recognizing potential hazards [7], resulting in low efficiency and high costs. Biosensors [8] can be integrated into wearable devices to collect medical or physical data, such as heart rate, blood pressure, body temperature, and acceleration [9], but these might seem invasive procedures that may provoke privacy or ethical problems. Deep neural network (DNN) methods are promising vision-based approaches; position data can be obtained by cameras without requiring the use of sensors, which is cost-effective and efficient. But these methods have shortcomings in detecting occluded or motion-blurring workers when the construction scenarios are complex.

Complex construction scenarios typically involve cross-operations of workers, machinery, or facilities. This environment requires a high degree of collaboration and cooperation to ensure the smooth progress of the construction process. Unlike the labor mode in assembly line factories, the construction site requires workers to perform irregular movements. Despite DNN methods being proven valuable for worker detection, they may fail when faced with such scenarios. In Figure 1, four types of challenges [10] are depicted that can hinder the detection accuracy, such as when workers are in proximity within a crowded scene (‘intra-class occlusion’, ID = 4 at t3) or obstructed by a wheel loader (‘inter-class occlusion’, ID = 3 at t3). In addition, when the detection confidence score is less than the threshold (ID = 1 at t1) or the worker is in unusual behavior/posture variation [11] (ID = 2 at t3), the bounding box may be missed.

Previous studies’ successes were owing to the development of more complicated networks or having extensively re-trained on new datasets; however, their effectiveness in addressing the challenges of complex scenarios was limited [12] by having to manually label new datasets [13], re-train entire neural networks [14,15], or switch to offline features [16]. Regardless of the chosen approaches, they primarily relied on the convolutional neural network (CNN) [10] outputs. Moreover, the CNNs relied on the scaled feature map of a certain area in one image/frame; if this area contains additional characteristics of other category classes than the worker class, it is impossible to register the worker’s bounding box as covering the whole body. This is determined by the calculation procedures of convolution computations. For example in Figure 2, the worker near the “Bar Shop” barrier has prompted whole body detection by ResNet-50 [17] (CNN variation) in left frame #94, but is heavily occluded by the fence baffle in middle frame #117, resulting in the detecting bounding box being limited to only a small area of the upper body (False Positive); whereas the right frame #117 shows the desired results in practice, which can be achieved by our proposed method.

The significance of this study is the possibility of fostering better compatibility between pedestrian applications and worker applications in the field of object detection and enhancing the replicability of general detection models. Working conditions across construction sites can differ significantly, resulting in varying job responsibilities and organizational procedures. Additionally, the detection model’s training may not account for factors such as obscured views, similar clothing and background hues among workers, and unusual actions. Therefore, formerly, scholars usually needed to explore new datasets and models for weighting re-training, and then, if the authors refused to publicly release the model weights, managers in the engineering field were almost unable to use the application. Our study endeavors to showcase the fact that a universal model downloaded from GitHub [18] can be made relevant to this engineering, with simple modifications.

The main objective of this study is to provide a low-coupling and easy-to-use method for identifying workers amidst intricate construction conditions. The contributions are threefold:

(1): Adapting a head branch to body detecting models in the construction field. Each branch can work with or without the other one and can be replaced easily.
(2): Demonstrating that a body branch designed for pedestrian detection can be directly used for worker detection without the need for re-training on a new dataset specific to the construction scenarios.
(3): Addressing the worker detection challenges of inter/intra class occlusions, low confidence, and unusual behaviors by means of joint coordinates, trajectories, and anthropometric data.

2. Related Works

2.1. Object Detection with DNN

Object detection is an active research field. Benefiting from multiple publicly available datasets [19,20,21], researchers can constantly invent new models to strive for better evaluation metrics. Object detection involves using DNN to determine the region of interest (ROI) [22] as a bounding box, as shown by the two workers’ rectangles in the right frame (#117) of Figure 2. DNN [23] is a general term for deep learning networks, which extract features from frames and perform tasks, such as classification or regression, by constructing multi-layer networks.

In a DNN model, each layer’s operations can be CNN, Recurrent Neural Network (RNN) [23], fully connected, or another custom definition. CNN increases the number of channels by reducing the sizes of feature maps. However, this operation often leads to feature confusion and difficulties in separating overlapped objects under occlusion situations. Therefore, detectors that rely on appearance features struggle to effectively handle occlusions when size reductions occur.

The detailed introduction of DNN models is beyond the scope of this article, but from the perspective of engineering applications, the models that can be publicly obtained and freely downloaded are our main concerns.

The general DNN models for person detection include Faster R-CNN [24], YOLO [25], CenterNet [26], and Transformer [27]. The YOLO series and CenterNet models need overall fine-tuning and lack flexibility regarding the addition of new CNN branches. The Transformer series involves even more trade-offs, like billions of parameters and a several-day training period, using powerful computational resources that not everyone has access to [28]. On the contrary, Faster R-CNN offers the ability to add head branches easily without affecting the original target detection, instance segmentation, and keypoint detection branches, avoiding the need for overall re-training.

2.2. Worker Detection Uniques

The universal detection models [24,29] mainly focus on pedestrian detection, for most of the scenes in their training datasets are gained from roads, as in the cases of PASCAL VOC [20], Microsoft COCO [19], and ImageNet ILSVRC [21]. Unlike pedestrian detection in the computer vision field, worker detection has its own unique characteristics:

Different purposes. Pedestrian detection is only for metrics competition [21], while worker detection is for safety management issues. Therefore, worker detection needs to effectively address position recognition under complex conditions, especially for occlusions.
Dissimilar durations. Pedestrian detection can be a collection of single frames in different scenes, while worker detection is based on continuous photography of certain projects. This provides us with the motion relationship between adjacent frames of the same worker.
Various densities. Some pedestrian detection datasets exhibit high density (62.1–226.6 persons per frame) [30], whereas worker datasets usually have lower density owing to limited operating space. This provides us with a relatively simple computing environment, avoiding the need to consider complex situations involving more than 100 workers in one frame.
Fixed perspectives. Worker datasets [11] are often captured from a top-down perspective to avoid hindering labor movements, offering better opportunities for detecting worker heads.
Behavior variations. Pedestrians mainly engage in walking activities, while workers’ behaviors are more complex, and they may bend, kneel, or squat [11]. This results in poor performance (low confidence scores or target missing) of the universal detection DNN model in non-standing postures.

2.3. Head Related Detection

Some researchers [15,31,32,33] have observed that the head is more visible than the body from a top-down perspective. Consequently, they used head information to enhance the whole-body feature map.

Zhang et al. [32] proposed Joint SimOTA, a model based on YOLOX-x [34] (756 MB) and integrated head detection and body detection. Training YOLOX-x entails investing 42 h on two NVIDIA RTX3090s, and the operator cannot modify each branch separately. Despite recognizing that the head enhances overall accuracy, sometimes the head and body must be tracked separately in construction scenarios when the detection content changes. Chi et al. [15] presented PedHunter, which leverages head information to improve the feature representation of the body in the backbone by introducing a mask-guided module that utilizes head location as additional perception technique for occluded pedestrians. JointDet [33] is a relational learning system designed to eliminate false head positives and recover missing bodies using a relationship-discriminating module, but to fine-tune such a model needs 16 GTX 1080Ti GPUs.

Earlier mentioned models necessitate the manual labeling of both the body and head of a single person, which poses a challenge for scalability and reusability. Obtaining effective weights for both body and head from separate datasets presents difficulties in their combination. This gap impedes the successful implementation of cutting-edge computational models in the engineering field and can impede the passion of engineering researchers.

2.4. Strengths and Limitations of SOTA Models

To provide a comprehensive analysis of the advantages and disadvantages of the eight current SOTA models and the proposed model, we present a concise summary in Table 1. This comparison demonstrates that when dealing with occlusion challenges in continuous monitoring scenarios, relying solely on the appearance features of a single frame is insufficient, except in cases where a fixed aspect ratio of the bounding box is assumed. However, by integrating both head appearance features and motion information from adjacent frames, we can effectively address complex challenges.

3. Methods

The proposed method contains three procedures, as shown in the blue dashed rectangles in Figure 3 and in Algorithm 1:

(1): Download the Keypoint R-CNN [38] and head-detection model from GitHub. If one cannot find a proper head model, a fast-trained head model can be deployed, as described in Section 3.2. Extract the region proposal networks (RPN) and Fast R-CNN parts from the head model and rename them as RPN_head and head_branch.
(2): Add RPN_head and head_branch to the original Keypoint R-CNN model without training. Enable this final DNN model to detect both the head and the entire body simultaneously; the results are saved as bounding boxes.
(3): Feed all heads and bodies’ bounding boxes into the post-processing module, and implement ID association, linear interpolation, and refinement sequentially. The ID association relates to the object tracking method, which will be discussed in subsequent papers.

Algorithm 1: Pseudo-code of the proposed method

Already follow the first procedure to assemble a head_integrated Keypoint R-CNN model. Already downloaded the weights file of “.pth”.
1: Input: the current frame #t (contains M workers in real from #0 to #t-1);
2: Feed the frame #t into the model and run, let the outputs contain N1 heads and N2 bodies:
The frame resolution = 1080 × 1920 × 3;
Feature Pyramid Network (FPN) sub-networks: outputs of five 256-channel feature maps;
RPN sub-networks: outputs of 257,796 anchors/proposals for heads or bodies;
Regions of interest (ROI) sub-networks: N1 heads and N2 bodies bounding boxes, respectively.
Suppose N1 ≠ N2:
the N1 head-index list = (0, 1, 2, 3, 4, 5, 6, 7, 8), the N2 body-index list = (0, 1, 2, 3, 4, 5).
3: ID association for (N2 bodies, N1 heads), detailed in our other papers:
Suppose = ((None, 3), (5, 6), (0, 0), (1, None), (3, 2), (2, 5), (4, 7), (None, 4), (None, 8))
the 3rd, 4th, 8th head do not find matched body; the 1st body do not find matched head.
4: Linear interpolation:
for i in length(3rd, 4th, 8th, … heads) do:
    if the head[i] is in the view:
    head_bounding box [i, t] = function (head [i, t-1], head [i, t-2]);
for j in length(1st, … bodies) do:
    if the body[j] is in the view:
    body_bounding box [j, t] = function (body [j, t-1], body [j, t-2];
                                                                                    head [i, t-1], head [i, t-2]);
5: The Refinements networks:
Feed the features and proposals of frame #t into the networks of the head or body:

# features = from P2 to P5, total 83.6 MB. # proposals = Array of np(1000, 4 = t, l, w, h).
def forward (features: List[torch.Tensor], proposals: List[np.array]);
def predict_boxes (predictions: Tuple[torch.Tensor, torch.Tensor], proposals: List[Boxes]);
def predict_probs (predictions: Tuple[torch.Tensor, torch.Tensor], proposals: List[Boxes]);
Output the final bounding boxes with the confidence of the head or body.

3.1. The Size of the Head and the Range of Motion

In continuous photography, body movements exhibit an approximate linearity with a magnitude comparable to the anthropometric data of the head. The fastest record for the 100 m sprint was 9.58 s, in the Olympic Games. When capturing a 10 m/s athlete at a rate of 30 frames per second (fps) using a regular camera, we can obtain 300 frames, with a movement of approximately 33 cm per frame. But workers’ average speed is 1.38 m/s [39], the movement of which is approximately 5 cm per frame. This 5~33 cm per frame magnitude is in alignment with the value for head size (Figure 4) multiplied by a proper ratio. The movement amplitude of construction workers can be measured by their head width, given the requirement for their prolonged and sustained labor, as shown in Figure 4, which naturally constrains their speed.

If the size of the worker’s head bears relation to the range of his motion, we can use head size and motion to determine a full-body bounding box in cases of heavy occlusions. Therefore, utilizing workers’ motion information will be considered an effective way to implement worker detection.

3.2. Fast-Trained Head Branch

This study chooses Keypoint R-CNN for the addition of a head branch. Since workers usually wear safety helmets and their facial features may not be obvious, we have to train a new head-detecting branch by ourselves.

Fortunately, we found a ready-made dataset from GitHub, the Safety Helmet Wearing Dataset (SHWD) [40], which avoids any need for manual labeling to determine the workers’ head ground-truth. The SHWD is annotated in the VOC format; thus, the format must be converted from XML to JSON [19]. The SHWD contained 7581 images and was created to search for safety helmets. In these images, 9044 persons wore safety helmets while 111,514 did not. Because our study did not focus on helmet issues, the two categories were combined into one head category for training purposes.

Subsequently, the backbone was frozen, and only the RPN and box branch were trained; training was performed according to the classical training process of Detectron2 [18]. After training, the box branch’s eight layers were extracted and re-named as the head branch; the RPN was then extracted, re-named as the RPN head, and added to the Keypoint R-CNN in parallel with the existing body branch for classification and bounding box regression. The modified Keypoint R-CNN was changed from 226 MB to 281 MB.

The training set comprised 5457 images with 86,200 heads, while the validation set comprised 607 images with 9925 heads. Furthermore, our model underwent 6800 iterations on an RTX3080 GPU, taking 1.25 h to achieve an AP50 of 81.24% (Table 2).

Moreover, we did not aim for a higher accuracy, because the AP50 for the person category under the COCO dataset was approximately 82.1% after 90,000 iterations. Refer to Figure 5 to observe the losses of head_branch throughout the iterations.

3.3. Head-Integrated Keypoint R-CNN

The original Keypoint R-CNN [38] is from Detectron2 [18]. The newly added head-related branches are shown in green (Figure 6) and are re-named RPN_head and head_branch in ROI_heads. The model aims to extract the bounding boxes of all the workers in each frame.

Keypoint R-CNN offers added head branches without affecting the original target detection, instance segmentation, or keypoint detection branches, avoiding the need for overall re-training. Keypoint R-CNN is a variant of Faster R-CNN [24] that includes an extra keypoint branch. The weights file used is ‘model_ Final_ 04e291.pkl’ [18], which can be found on the homepage of Detectron2. Furthermore, it comprises three modules: FPN, RPN, and ROI.

To detect small targets such as worker heads, we used one frame (1080 × 1920 × 3) as the input. The FPN performed feature extraction, resulting in five 256-channel feature maps (P2–P6), as shown in Figure 6.

Next, the RPN module generated anchor boxes using the five feature maps P2–P6. Each anchor has three aspect ratios (0.5, 1.0, and 2.0) and five scaling ratios (32, 64, 128, 256, and 512). Therefore, each anchor corresponds to three possible anchor boxes, one of which must have five sizes. For each frame, RPN searches for 257,796 (=192 × 336 × 3 + 96 × 168 × 3 + 48 × 84 × 3 + 24 × 42 × 3 + 12 × 21 × 3) anchor boxes. After applying the NMS (threshold = 0.7), only the top 1000 anchors per frame were retained.

Finally, no more than 2000 anchor boxes were left for the head and body. After the ROI-alignment quantification operation, each anchor box on each feature map was transformed into a 7 × 7 × 256 feature map. The head_branch and body_branch in Figure 5 are used for classification and regression, whereas the keypoint_branch calculates the joint coordinates of the workers.

3.4. Post-Processing

The head-integrated Keypoint R-CNN is applied to a single frame, but this can only obtain outcomes similar to the middle frame in Figure 2, for the separated head and body branches do not exchange their information. An additional post-processing stage is needed to use head motion information between multiple frames to assist in full body recognition of the current frame.

3.4.1. Interpolation of Center Points

The linear interpolation uses direction restrictions for the center coordinates but not for the bounding box, whose (t, l, w, h) will be rectified in Section 3.4.2. This operation considers the head movement and utilizes three consecutive frames for the center coordinates’ linear interpolation. It makes judgments regarding the stationarity or movement of the head and body. When occluded, like in Figure 1, the model may fail to draw the bounding box, as shown in the blue dashed rectangle in Figure 7 of the worker ID = 0, and the trajectory appears interrupted.

As to missing heads, because the head size is relatively fixed and only slightly influenced by action or behavior, this study adopted a three-frame interpolation to infer the missing head’s central point coordinates, as described in Equation (1) and head_Imputer in Figure 7.

\begin{array}{l} x_{c, h e a d}^{t} = x_{c, h e a d}^{t 2} + (x_{c, h e a d}^{t 2} - x_{c, h e a d}^{t 1}), \\ y_{c, h e a d}^{t} = y_{c, h e a d}^{t 2} + (y_{c, h e a d}^{t 2} - y_{c, h e a d}^{t 1}), \\ t_{1} < t_{2} < t \end{array}

(1)

When a head detection is missed, the interpolation operation predicts the likely result of the current frame by using the data from the previous two frames. However, workers may stop moving at any point during manual labor; if they decide to remain stationary, emergency interpolation control should be performed. Overall, there are three potential states for heads:

(1): Not stationary and not out of view—in this most common state, simple linear interpolation is appropriate.
(2): Stationary and not out of view: This state applies to workers standing still with minimal movement in their head position. If the head motion trajectory displacement is ≤1 pixel and the head bounding box confidence ≥ 0.0001 in successive frames, the head is stationary.
(3): Not stationary and out of view: Workers may exit the frame or be obscured by large objects. All of the values were set to NaN. If head confidence ≤ 0.0001, regardless of head motion displacement, the worker will have left the field of view.

To infer the missing body central point coordinates, as in Equation (2) and body_Imputer in Figure 7, when the body has the same direction as the head’s horizontal or vertical coordinates, we can apply the data of the body or use the head data, and vice versa.

\begin{array}{l} \{\begin{cases} x_{c, b o d y}^{t} = x_{c, b o d y}^{t 2} + (x_{c, b o d y}^{t 2} - x_{c, b o d y}^{t 1}), s a m e d i r e c t i o n o f x \\ x_{c, b o d y}^{t} = x_{c, b o d y}^{t 2} + (x_{c, h e a d}^{t 2} - x_{c, h e a d}^{t 1}), o p p o s i t e d i r e c t i o n \end{cases} \\ \{\begin{cases} y_{c, b o d y}^{t} = y_{c, b o d y}^{t 2} + (y_{c, b o d y}^{t 2} - y_{c, b o d y}^{t 1}), s a m e d i r e c t i o n o f y \\ y_{c, b o d y}^{t} = y_{c, b o d y}^{t 2} + (y_{c, h e a d}^{t 2} - y_{c, h e a d}^{t 1}), o p p o s i t e d i r e c t i o n \end{cases} \\ t_{1} < t_{2} < t \end{array}

(2)

There are two potential states for bodies:

(1): When there is a related head detection, the body can be determined to be stationary or leaving according to the head trajectory.
(2): When there is an absence of head detection, the judgment of stationary must use keypoints to assist with the decision of stationarity, as in the following situations: if the worker is not likely to move smoothly under the rarer action behavior (bending, squatting, lying down, and so on); if the worker is walking toward peripheral edges of the view; or if the body bounding box confidence score ≤0.01.

The above codes rely on some thresholds, which may require some effort. However, this is currently the simplest way to interact with motion information.

3.4.2. Refinement of Widths and Heights

Refinement comprises bounding box regression and rectification, which are responsible for the final top-left vertex coordinates, height, and width of the nonstationary cases.

Faster R-CNN networks [22,24,29] must utilize bounding box regression algorithms to improve the final detection accuracy. The proposed candidates may not be completely accurate, and the regression algorithm proposed by Girshick et al. [41] can bring the predicted target frame closer to the nearby ground truth, as described in Equation (3).

Figure 8 shows the refinement framework, which is composed of fully connected layers and sets itself apart from the Keypoint R-CNN. The ROI-Align operation takes the sole input, and produces a flattened 12544-d vector from its 7 × 7 × 256 output. Next, two 1024-d layers were utilized to create 4-d regression deltas and 2-d confidence scores.

\begin{array}{l} t_{x} = w e i g h t_{x} \times (G_{x} - P_{x}) / P_{w} \\ t_{y} = w e i g h t_{y} \times (G_{y} - P_{y}) / P_{h} \\ t_{w} = w e i g h t_{w} \times \log (G_{w} / P_{w}) \\ t_{h} = w e i g h t_{h} \times \log (G_{h} / P_{h}) \end{array}

(3)

where G = (Gx, Gy, Gw, Gh) is the ground-truth bounding box, and P = (Px, Py, Pw, Ph) is the prediction box determined by Keypoint R-CNN. Scaling factor weights were initially set so that the deltas had unit variance. Furthermore, they are treated as hyperparameters of weight_x = 10, weight_y = 10, weight_w = 5 and weight_h = 5.

Girshick et al. [41] observed that bounding box regression may not always be reliable. Therefore, we devised a four-step rectification operation for the bounding boxes:

(1): In adjacent frames and under polar coordinates, keep the change of modulus and argument of the body-to-head’s center vector (green arrow line in Figure 7) no more than 0.5× head width and 5°. If it cannot be guaranteed, change the position of the center of the body-bounding box based on the center of the head.
(2): Keep the body’s center unchanged and adjust the position of the body’s top-left vertex. Limit the movement of the top-left vertex of the entire body to approximately 0.25× head width.
(3): If there are two joint points of the foot, height can have a significant change, which can change slightly according to the “near-big, far-small” principle and head size.
(4): The top edge of the bounding box can be adjusted using the head data. Because the head cannot be separated from the whole body, we rectified the whole-body top-edge position for changes in the relative position of the center of the head. The displacement of the top edge of the body from the center of the head did not change significantly between adjacent frames.

3.5. Evaluation Metrics

Evaluation metrics assess the detection performance in object detection from a single image; metrics such as accuracy, precision, and recall are typically employed, as in Equations (4)–(6). Precision and recall are usually used to measure the performance of classification models, as they focus on the classification ability of the model in positive samples. Accuracy cannot objectively evaluate algorithm performance when there are unbalanced positive and negative samples; therefore, it is not commonly used.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} = \frac{T P}{T P + F P + F N}

(4)

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

where True Positive (TP) is the number of correctly classified detections, those with an IoU ≥ 0.5. False Positive (FP) is the number of incorrectly typed detections, those with an IoU ≤ 0.5 or where non-worker class identified as worker class. False Negative (FN) is the number of missed detections. True Negative (TN) is the number of non-worker category objects correctly identified, when there is only one class in dataset, TN = 0.

Also, evaluation metrics in MOTChallenge [42,43] supply a standardized way of metrics comparison, parts of which are video-based-detection-related. CLEAR MOT [44] proposes the metric IDF1, as described in Equation (7).

I D F 1 = \frac{2 \times I D T P}{2 \times I D T P + I D F P + I D F N}

(7)

where IDTP (true positive ID) is the number of correctly assigned IDs throughout the entire video; IDFP (false positive IDs) is the number of incorrectly assigned IDs; IDFN (False Negative ID) is the number of missed IDs.

4. Results

4.1. Experimental Details

Head-integrated Keypoint R-CNN was implemented by Python 3.8.8 and PyTorch 1.8.0. The Keypoint R-CNN is modified from Detectron2. The computer had a NVIDIA RTX 3080 GPU, an IntelI(R) Core (TM) i9-10900K CPU @ 3.70 GHz, and one 32 GB RAM module, with Win10.

The nine video testing datasets in this study are the same as in Xiao et al. [36] and Konstantinou et al. [11], all of which were characterized by 30 fps and a resolution of 1920 × 1080. Furthermore, owing to the occlusion-based research inclination, Table 3 shows more than 500 frames under occlusion challenges within a total of 2148 frames.

4.2. Quantitative Results

We generated benchmark results for the nine-video dataset listed in Table 4. This approach avoided the need to re-train DNNs and presented comprehensive and standard metrics of worker detecting, aligned with MOTChallenge for the first time.

4.3. Qualitative Results

This section demonstrates the practicality of the proposed method when faced with four types of challenges, as described in [10] and in Figure 1.

(1) Successfully resolved intra-class occlusion. Video 4 has five workers in frame #2 (left column images in Figure 9). From frames #42 to #46, the bounding-box outputs of the head-integrated Keypoint R-CNN with (lower-middle image in Figure 9) or without (upper-middle image in Figure 9) post-processing are the same. But when it comes to frame #47, ID = 1 and ID = 2 will have intra-class occlusion problems (occluder and occludee are the same worker category [13]). Without post-processing, the identification in the upper-right image will miss the bounding box of ID = 1, whereas the lower-right image shows the success of our proposed method.

(2) Successfully resolved inter-class occlusion. Video 7 has four workers in frame #9 (left column images in Figure 10). From frames #41 to #78, the head-integrated Keypoint R-CNN runs normally. But in frame #79, ID = 1 will have inter-class occlusion problems (occluder is the fence category, occludee is the worker category [13]). Our proposed method uses motion information from the previous two frames to predict the right position of ID = 1, while the CNN failure is shown in the upper-right image in Figure 10.

(3) Resolved low confidence-score detections. In the absence of any occlusions, it is also possible that detections of workers may have lower confidence scores due to the inherent aspects of CNN calculation. Under the NMS mechanism, such low-score bounding boxes will be deleted as FNs. The proposed method can recycle these FN bounding boxes successfully. As shown in Figure 11, Video 3 has nine workers in frame #11 (left column images in Figure 11); due to the long distance between ID = 7 and the camera, the confidence score of his bounding box is below the NMS threshold and is deleted by CNNs (upper-right image). By adopting our method, worker 7 was assigned a reasonable bounding box successfully. Even a low score, such as in the case of worker 8, can result in a bounding box (lower-right image).

(4) Successfully resolved unusual behavior. The worker ID = 0 changed their position from a standing posture to a squat and was subject to occlusion by the prefabrication lathe; this cannot be detected by CNNs, because there are seldom such unusual postures in pedestrian training datasets. ID = 0 maintains the correct detection results for width and height after our calculation, as shown in the lower-right image of Figure 12.

5. Discussion

5.1. Ground-Truth Annotations Standard

Xiao et al. [36] did not provide annotations; therefore, we had to manually label the ground-truth (GT) bounding boxes using MOT16_Annotator. To ensure a fair comparison, we maintained the same number of workers in each video during the annotation. In total, 11 workers were observed in the nine videos, as shown in Figure 13. The images of these workers either only showed their legs or were at a significant distance and did not show their faces.

The GT labeling principle is outlined in Figure 14, with eight workers in six videos (Videos 3, 4, 5, 7, 8, and 9) presenting this issue. That is, (1) labeling must be performed only on the visible parts if the entire body is not visible throughout the video. Workers in the red rectangles were never shown as full-size; therefore, the label was limited to only the visible part. (2) If the entire body is visible in the middle frames, the labeling of the whole-body size should be predictable. Workers’ images in the green rectangles show full-sized bodies in the early frames, allowing manual labeling with a decent prediction.

In certain specific studies, refs. [37,45] claimed to have already effectively solved the occlusion problem but had misunderstandings in the GT labeling. Xiao et al. [37] take the attitude that “The objects… should not be annotated because more than 70% of the objects are occluded”, so when faced with the worker in the middle image of Figure 14, the missing detection of the worker in the green rectangle does not increase any errors, as they believed no workers to be in this position. This erroneous cognition prevents other researchers from exceeding the artificially high metrics. Therefore, adhering to the annotation principle here is a fair and logical way to solve the occlusion-related research difficulty.

As explained above, we cannot expect a DNN that has never considered whole-body annotations of severely occluded workers to achieve good performance in detection models. This study has published the GT-box file in MOT16 format for reference on GitHub (https://github.com/lyy0095/Ground-truth-of-nine-video-dataset, accessed on 1 January 2024.).

5.2. Comparison with Other Methods

When compared to the other existing worker-detection methods described in Table 5, the proposed method performs at the SOTA level, with an IDF1 of 97.609%. Park et al. [35] and Konstantinou et al. [11] do not apply deep learning methods for early worker detection, and are focused more on computer vision; therefore, their accuracy is negatively affected. Xiao et al. [36] re-trained on their new dataset, but ignored the GT annotations in severe occlusion situations, resulting in exaggeration of metrics values.

To the extent of our testing, the best performance by a DNN in detection of persons obtained from the computer science field is ByteTrack [46] (bytetrack_x_mot17.pth.tar, 774,254 KB, downloaded from https://drive.google.com/uc?id=1P4mY0Yyd3PPTybgZkjMYhFri88nTmJX5, accessed on 1 January 2024). We conducted an exhaustive model search in the field of detection, but did not expect the outstanding detection model to be in the tracking field. Unfortunately, all bounding boxes in ByteTrack have an almost identical aspect ratio; that is, most of the person-annotations of GT are in standing postures, a result which may not suitable for workers’ unusual behaviors.

The implementation source codes of ByteTrack [46], Deep OC_SORT [47], BoTSORT [48], OC_SORT [49], and Strong_SORT [50] come from [51], while TransTrack [52] and UniTrack [53] adopt the authors’ original codes, and DeepSORT uses an implementation from [54]. These methodologies all rely on YOLOX-x, as we know that the anchor-free weights file (yolox_x.pth, 756 MB) outperformed all other detectors and facilitated a relatively fair comparison among various tracking and association processes. In Table 6, one can observe that the proposed method is also SOTA, even in comparison with eight advanced pedestrian detecting methods.

Traditional pedestrian tracking methods (containing advanced detecting models) cannot smoothly solve the problem of worker tracking, as shown in Table 7.

5.3. Effectiveness and Limitations

First, we made reasonable use of the continuous movement information of workers. This study decisively abandoned the traditional mode of re-training on a new dataset and adopted a novel way to use head-based assistance instead. Meanwhile, we noticed the approximately linear motion amplitude evident in workers under continuous photography capture, and then interpolation and rectification were used as post-processing operations.

Second, we took advantage of camera orientation under construction scenarios. The head branch demonstrates good accuracy in detecting the head compared to the body, resulting in less impact from occlusions within a top-down perspective. When the body is not detectable due to occlusion, we can use the head information to estimate the body bounding box and rectify it. Furthermore, sustained head trajectories across frames provide a foundation for correcting body trajectories and ensuring the correct directions of body movements, determinations which cannot be obtained solely from the complex limb motions of the body.

Third, we conducted extensive model searches and comparative experiments. Based on our understanding of the processes of detecting and tracking, bigger models typically exhibit better performance but require higher computational costs. We cannot expect the construction industry to obtain such advanced computing resources in order to adapt to their unique scenarios; therefore, models that do not require re-training are the best choice. Under such an idea, low coupling and high reusability DNNs are more suitable for practical applications in the construction field. That said, if we find a better model for detection of workers, it will be easy to add the head branch in practice.

Fourth, the method integrates detection and tracking. The construction video is not a single image, but a time series containing information derived from previous frames, containing cues of workers’ motions and behaviors; therefore, the approaches limited to single frame detection are too rigid and lag behind academic trends. Due to its length limitations, this study did not provide details of tracking algorithms; these will be specifically mentioned in subsequent papers.

The limitations of the present work include two aspects. Firstly, although the test dataset is consistent with those of existing articles [11], the number of workers [36] in the nine videos is still too small, and the complexity of the scene needs to be increased. Secondly, the head training branch did not find a suitable open-source model, resulting in 1.25 h of training.

5.4. Theoretical and Practical Implications

In theory, this study transformed the dependency of a DNN after repeated training and suggested that reasonable modifications to the general model can also achieve good detection results for workers. During the modification process, the head-based information and adjacent frames’ motion-based information were used reasonably, a technique which proved suitable for construction sites with frequent occlusions and unusual behaviors.

In practice, the original Keypoint R-CNN can be used without re-training if a proper head-detection branch exists. The training-free model reduces the demand for computing resources and the need for manual annotation for new datasets. Such a plug-and-play mindset is suitable for engineers; after all, engineering management is not a computer science competition, with better metrics performance, but where increased requirements and hard-to-deploy implementations limit models’ practical applications.

After detection, future research will focus on worker tracking in complex scenarios and trajectory prediction of worker–vehicle interactions.

In summary, this study not only provides new solutions for worker detection problems in complex construction scenarios but also lays the foundation for promoting the widespread application of this technology in the construction industry. We believe that with the continuous progress and optimization of technology, work in this field will usher in broader development prospects.

6. Conclusions

This study has proposed a novel head-integrated Keypoint R-CNN to address the complex construction scenario challenges of inter- or intra-class occlusions, as well as to deal with targets missing challenges by low confidence scores and unusual behaviors. The model breaks the DNN conventions that rely heavily on new datasets for re-training and provides a new perspective with a motion information post-processing module for worker detection. Furthermore, we demonstrated that conventional detection DNNs designed for pedestrians, even those with a not-accurate-enough ResNet-50 backbone, can be used in worker detection without re-training. Compared to fixed scenarios, the open universal model usually provides better performance and wider applicability. A model self-retrained on its new datasets but not publicly available does not provide much practical assistance for other researchers. In the future, if a more suitable Keypoint R-CNN were to be launched by Detectron2, the proposed model could renew and better its weighting easily.

Author Contributions

Conceptualization, Y.L. and Y.W.; methodology, Y.L.; software, Y.L.; validation, Y.L., Z.Z. and C.S.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and Z.Z.; visualization, Y.L.; supervision, Y.W. and C.S.; project administration, Y.L. and Y.W.; funding acquisition, Y.W. and C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 71901082) and (No. 51878026).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Occupational Safety and Health Administration. Construction Industry. 2023. Available online: https://www.osha.gov/construction (accessed on 29 August 2023).
U.S. Bureau of Labor Statistics. Census of Fatal Occupational Injuries Summary. 2021. Available online: https://www.bls.gov/news.release/cfoi.nr0.htm (accessed on 29 August 2023).
Xu, W.; Wang, T.-K. Dynamic safety prewarning mechanism of human–machine–environment using computer vision. Eng. Constr. Archit. Manag. 2020, 27, 1813–1833. [Google Scholar] [CrossRef]
Occupational Safety and Health Administration, OSH Act of 1970. Available online: https://www.osha.gov/laws-regs/oshact/toc (accessed on 29 August 2023).
Mokhtari, F.; Cheng, Z.; Wang, C.H.; Foroughi, J. Advances in Wearable Piezoelectric Sensors for Hazardous Workplace Environments. Glob. Chall. 2023, 7, 2300019. [Google Scholar] [CrossRef] [PubMed]
Duan, P.; Zhou, J.; Tao, S. Risk events recognition using smartphone and machine learning in construction workers’ material handling tasks. Eng. Constr. Archit. Manag. 2023, 30, 3562–3582. [Google Scholar] [CrossRef]
Cai, J.; Zhang, Y.; Yang, L.; Cai, H.; Li, S. A context-augmented deep learning approach for worker trajectory prediction on unstructured and dynamic construction sites. Adv. Eng. Inform. 2020, 46, 101173. [Google Scholar] [CrossRef]
Zhang, M.Y.; Cao, T.Z.; Zhao, X.F. Applying Sensor-Based Technology to Improve Construction Safety Management. Sensors 2017, 17, 1841. [Google Scholar] [CrossRef] [PubMed]
Gondo, T.; Miura, R. Accelerometer-Based Activity Recognition of Workers at Construction Sites. Front. Built Environ. 2020, 6, 563353. [Google Scholar] [CrossRef]
Ning, C.; Menglu, L.; Hao, Y.; Xueping, S.; Yunhong, L. Survey of pedestrian detection with occlusion. Complex Intell. Syst. 2021, 7, 577–587. [Google Scholar] [CrossRef]
Konstantinou, E.; Lasenby, J.; Brilakis, I. Adaptive computer vision-based 2D tracking of workers in complex environments. Autom. Constr. 2019, 103, 168–184. [Google Scholar] [CrossRef]
Li, F.; Li, X.; Liu, Q.; Li, Z. Occlusion Handling and Multi-Scale Pedestrian Detection Based on Deep Learning: A Review. IEEE Access 2022, 10, 19937–19957. [Google Scholar] [CrossRef]
Zhan, G.; Xie, W.; Zisserman, A. A Tri-Layer Plugin to Improve Occluded Detection. arXiv 2022, arXiv:2210.10046. [Google Scholar] [CrossRef]
Ke, L.; Tai, Y.-W.; Tang, C.-K. Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers. arXiv 2021, arXiv:2103.12340. [Google Scholar] [CrossRef]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes. arXiv 2019, arXiv:1909.06826. [Google Scholar] [CrossRef]
Wang, Q.; Chang, Y.-Y.; Cai, R.; Li, Z.; Hariharan, B.; Holynski, A.; Snavely, N. Tracking Everything Everywhere All at Once. arXiv 2023, arXiv:2306.05422. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Facebookresearch, Detectron2. 2023. Available online: https://github.com/facebookresearch/detectron2 (accessed on 30 August 2023).
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. arXiv 2019, arXiv:1904.08189. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Rekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E. Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art. arXiv 2023, arXiv:2309.04902. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar] [CrossRef]
MOTChallenge, Pedestrian Detection Challenge. 2014. Available online: https://motchallenge.net/data/MOT20/ (accessed on 29 August 2023).
Lin, C.Y.; Xie, H.X.; Zheng, H. PedJointNet: Joint Head-Shoulder and Full Body Deep Network for Pedestrian Detection. IEEE Access 2019, 7, 47687–47697. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, H.; Bao, W.; Lai, Z.; Zhang, Z.; Yuan, D. Handling Heavy Occlusion in Dense Crowd Tracking by Focusing on the Heads. arXiv 2023, arXiv:2304.07705. [Google Scholar] [CrossRef]
Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Relational Learning for Joint Head and Human Detection. arXiv 2019, arXiv:1909.10674. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Park, M.W.; Brilakis, I. Continuous localization of construction workers via integration of detection and tracking. Autom. Constr. 2016, 72, 129–142. [Google Scholar] [CrossRef]
Xiao, B.; Xiao, H.; Wang, J.; Chen, Y. Vision-based method for tracking workers by integrating deep learning instance segmentation in off-site construction. Autom. Constr. 2022, 136, 104148. [Google Scholar] [CrossRef]
Xiao, B.; Lin, Q.; Chen, Y. A vision-based method for automatic tracking of construction machines at nighttime based on deep learning illumination enhancement. Autom. Constr. 2021, 127, 103721. [Google Scholar] [CrossRef]
Wu, A.K.Y.; Massa, F.; Lo, W.-Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 30 August 2023).
Kim, M.; Ham, Y.; Koo, C.; Kim, T.W. Simulating travel paths of construction site workers via deep reinforcement learning considering their spatial cognition and wayfinding behavior. Autom. Constr. 2023, 147, 104715. [Google Scholar] [CrossRef]
njvisionpower, Safety-Helmet-Wearing-Dataset. 2019. Available online: https://github.com/njvisionpower/Safety-Helmet-Wearing-Dataset (accessed on 30 August 2023).
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
Dendorfer, P.; Ošep, A.; Milan, A.; Schindler, K.; Cremers, D.; Reid, I.; Roth, S.; Leal-Taixé, L. MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking. arXiv 2020, arXiv:2010.07548. [Google Scholar] [CrossRef]
Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar] [CrossRef]
Bernardin, K.; Stiefelhagen, R. Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics. Eurasip J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Xiao, B.; Kang, S.-C. Vision-Based Method Integrating Deep Learning Detection for Tracking Multiple Construction Machines. J. Comput. Civ. Eng. 2021, 35, 04020071. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2022, arXiv:2110.06864. [Google Scholar] [CrossRef]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. arXiv 2023, arXiv:2302.11813. [Google Scholar] [CrossRef]
Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. arXiv 2022, arXiv:2203.14360. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. arXiv 2022, arXiv:2202.13514. [Google Scholar] [CrossRef]
Mikel-Brostrom, Yolo_Tracking. 2023. Available online: https://github.com/mikel-brostrom/yolo_tracking#real-time-multi-object-segmentation-and-pose-tracking-using-yolov8--yolo-nas--yolox-with-deepocsort-and-lightmbn (accessed on 31 August 2023).
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, H.; Li, Y.-L.; Wang, S.; Torr, P.H.S.; Bertinetto, L. Do Different Tracking Tasks Require Different Appearance Models? arXiv 2021, arXiv:2107.02156. [Google Scholar] [CrossRef]
pmj110119, YOLOX_deepsort_tracker. 2021. Available online: https://github.com/pmj110119/YOLOX_deepsort_tracker (accessed on 31 August 2023).

Figure 1. Challenges in detecting workers: the red worker in intra-class occlusion; the blue one in inter-class occlusion; the green one with low confidence; the black one in unusual behaviors.

Figure 2. Examples of CNN limitations.

Figure 3. Flowchart of the proposed model.

Figure 4. Head anthropometric data and the average speed of workers.

Figure 5. Loss of head_branch training over 6800 steps.

Figure 6. Head-integrated Keypoint R-CNN.

Figure 7. Process illustrations of linear interpolation: the red arrow is interpolation for the head, while the yellow arrow is that for the body.

Figure 8. Process illustrations of refinement: different colors on behalf of different number of weights in layers.

Figure 9. Example of detection results for Video 4.

Figure 10. Example of detection results for Video 7.

Figure 11. Example of detection results for Video 3.

Figure 12. Example of detection results for Video 5.

Figure 13. Examples of omitted workers in the dataset of the nine videos.

Figure 14. Examples of the GT labeling principle.

Table 1. Strengths and limitations of SOTA models.

Index	SOTA Models	Strengths	Limitations
1	Faster R-CNN [24]	Anchor-based, two-stage, low-coupling, easy to deploy.	Limited performance when heavily occluded.
2	CenterNet [26]	Anchor-free, fast speed, high accuracy, and low memory usage.	One stage, high-coupling, more False Positives (FP).
3	Transformer [27]	CNN backbone, encoder–decoder, and no post-processing.	Billions of parameters, several days’ training, CNN backbone, not good in dealing with occlusions.
4	YOLOX-x [34]	Many models for download, outstanding in tracking field, good to address occlusions (but supposes all occluded persons are standing).	One stage, high-coupling, better GPU needed (>16 GB), hard to fine-tune.
5	Joint SimOTA [32]	YOLOX-x-based, head and body features are detected simultaneously.	One stage, high-coupling, the head and body must belong to the same person in re-training datasets, standing persons are in larger proportion.
6	Park et al. [35]	Early studies, but good baseline for future researchers.	Limited to the upper body, 64 × 128 HOG template to locate a worker, not DNN.
7	Konstantinou et al. [11]	First to propose workers’ videos, a good baseline for future researchers.	Heuristic computer vision model, not DNN.
8	Xiao et al. [36]	First to align with [11] datasets, two-stage, low-coupling, Mask R-CNN [29] variation.	Needs a manual collection of workers’ images, needs re-training for Mask R-CNN, ignores heavy occlusions (>70% [37]), non-public annotation files.
9	Our model	Preserves alignment with [11] datasets, two-stage, low-coupling, public annotation files on GitHub, and integrates head and motion information. Head and body features can be detected separately.	Needs trained head branch, head and body features have no relations during detections.

Table 2. AP of the bounding box.

	AP	AP₅₀	AP₇₅	AP_s	AP_M	AP_L	Notes
Head	40.0	81.2	33.6	29.9	55.2	71.4	Fast-trained in 1.25 h
Body	53.6	82.1	58.0	70.8	61.4	36.0	No re-training
Keypoints	63.9	86.3	69.2	N/A	59.4	72.3	No re-training

Table 3. The dataset from the summarization of the nine videos.

Video#	Total Frames	Workers	Occlusion-Challenged Frames
Video 1	228	1	No occlusion
Video 2	259	3	#139→#259
Video 3	211	9	#14→#23, #195→#211
Video 4	197	5	#38→#57, #168→#181
Video 5	355	5	#48→#73
Video 6	132	4	No occlusion
Video 7	358	4	#94→#313
Video 8	259	6	#224→#243
Video 9	149	4	#73→#149
In Total	2148	41	>500

Table 4. Dataset of the performance of the proposed method for the nine videos.

Metrics		Video 1	Video 2	Video 3	Video 4	Video 5	Video 6	Video 7	Video 8	Video 9	Combined
1	IDF1↑ (%)	100	97.04	96.966	96.824	97.468	99.431	97.73	98.422	97.007	97.609
2	MT↓	1	3	9	5	5	4	4	6	4	41
3	ML↓	0	0	0	0	0	0	0	0	0	0
4	FP↓	0	23	48	41	79	4	25	25	21	266
5	FN↓	0	23	54	21	9	2	17	24	13	163
6	Rcll↑ (%)	100	97.04	96.793	97.826	99.472	99.62	98.154	98.454	97.695	98.173
7	Prcn↑ (%)	100	97.04	97.139	95.842	95.544	99.242	97.309	98.39	96.329	97.052
8	DetA↑ (%)	93.023	75.624	78.646	72.991	69.741	71.198	76.396	80.397	76.77	75.591
9	DetRe↑ (%)	94.298	81.474	84.229	79.906	78.957	77.547	82.336	84.926	83.408	82.252
10	DetPr↑ (%)	94.298	81.474	84.53	78.285	75.839	77.253	81.627	84.871	82.241	81.313
11	Hz↑ (fps)	9.05	6.01	5.69	5.96	7.11	6.69	6.62	5.33	6.82	6.58

‘↑’ means higher is better, ‘↓’ means lower is better.

Table 5. IDF1 comparison between methods.

IDF1↑ (%)	Video 1	Video 2	Video 3	Video 4	Video 5	Video 6	Video 7	Video 8	Video 9	Average
[35]	55.5	50.8	N/A	N/A	43.9	46.9	15.7	25.8	61.0	42.8
[11]	83.7	77.3	N/A	N/A	74.4	82.3	65.3	54.1	81.6	74.1
[36]	99.8	99.2	93.4	98.1	99.4	99.2	95.3	99.6	94.8	97.6
Our	100	97.04	96.966	96.824	97.468	99.431	97.73	98.422	97.007	97.609

‘↑’ means higher is better.

Table 6. Pedestrian detecting performance on the nine-video dataset of workers.

Metrics		DeepSORT	ByteTrack	Deep OC_SORT	BoTSORT	OC_SORT	Strong_SORT	TransTrack	UniTrack
1	IDF1↑ (%)	79.771	81.829	74.471	77.365	80.633	77.065	70.459	61.887
2	MT↑	35	33	28	24	27	29	27	15
3	ML↓	1	3	4	4	4	4	5	10
4	FP↓	1761	1980	2167	1191	958	2211	3151	3124
5	FN↓	902	1359	1669	2307	2065	1640	2253	3334
6	Rcll↑ (%)	89.889	84.766	81.291	74.14	76.852	81.616	74.745	62.628
7	Prcn↑ (%)	81.994	79.25	76.993	84.741	87.74	76.707	67.909	64.137
8	DetA↑ (%)	63.151	58.219	55.053	54.82	59.114	55.128	46.062	40.81
9	DetRe↑ (%)	77.871	71.347	69.081	63.111	66.059	69.433	62.474	49.839
10	DetPr↑ (%)	71.032	66.704	65.428	72.135	75.418	65.256	56.76	51.041
11	Hz↑	1.77	9.09	8.33	7.14	8.47	7.69	5.13	4.30

‘↑’ means higher is better, ‘↓’ means lower is better.

Table 7. Samples of pedestrian detecting method outputs on the dataset of workers.

Samples of Output Images	Descriptions
	ByteTrack loses worker ID = 0 in Video 5 #89 given the inter-class occlusion by the steel prefabrication lathe.
	ByteTrack wrongly predicted worker ID = 8/9 in Video 3 #22 as full size, although their full-size bodies have not been visible from #1.
	Deep OC_SORT cannot utilize motion information for absent detections, like worker ID = 4 in Video 9 losing his bounding box when occluded by steel mesh.
	Vanilla DeepSORT failed to solve the intra-class occlusion of worker ID = 2 in Video 3 #14 and #18.
	BoTSORT cannot filter out the FP detection of worker ID = 8 in Video 8.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Zhou, Z.; Wang, Y.; Sun, C. Head-Integrated Detecting Method for Workers under Complex Construction Scenarios. Buildings 2024, 14, 859. https://doi.org/10.3390/buildings14040859

AMA Style

Liu Y, Zhou Z, Wang Y, Sun C. Head-Integrated Detecting Method for Workers under Complex Construction Scenarios. Buildings. 2024; 14(4):859. https://doi.org/10.3390/buildings14040859

Chicago/Turabian Style

Liu, Yongyue, Zhenzong Zhou, Yaowu Wang, and Chengshuang Sun. 2024. "Head-Integrated Detecting Method for Workers under Complex Construction Scenarios" Buildings 14, no. 4: 859. https://doi.org/10.3390/buildings14040859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Head-Integrated Detecting Method for Workers under Complex Construction Scenarios

Abstract

1. Introduction

2. Related Works

2.1. Object Detection with DNN

2.2. Worker Detection Uniques

2.3. Head Related Detection

2.4. Strengths and Limitations of SOTA Models

3. Methods

3.1. The Size of the Head and the Range of Motion

3.2. Fast-Trained Head Branch

3.3. Head-Integrated Keypoint R-CNN

3.4. Post-Processing

3.4.1. Interpolation of Center Points

3.4.2. Refinement of Widths and Heights

3.5. Evaluation Metrics

4. Results

4.1. Experimental Details

4.2. Quantitative Results

4.3. Qualitative Results

5. Discussion

5.1. Ground-Truth Annotations Standard

5.2. Comparison with Other Methods

5.3. Effectiveness and Limitations

5.4. Theoretical and Practical Implications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI