Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models

Gudlin, Mihael; Hegedić, Miro; Golec, Matija; Kolar, Davor

doi:10.3390/app14031185

Open AccessArticle

Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models

Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Ivana Lučića Street 5, 10002 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(3), 1185; https://doi.org/10.3390/app14031185

Submission received: 17 December 2023 / Revised: 26 January 2024 / Accepted: 29 January 2024 / Published: 31 January 2024

(This article belongs to the Special Issue Computer Vision in Human Activity Recognition and Behavior Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

In the quest for industrial efficiency, human performance within manufacturing systems remains pivotal. Traditional time study methods, reliant on direct observation and manual video analysis, are increasingly inadequate, given technological advancements. This research explores the automation of time study methods by deploying deep learning models for action segmentation, scrutinizing the efficacy of various architectural strategies. A dataset, featuring nine work activities performed by four subjects on three product types, was collected from a real manufacturing assembly process. Our methodology hinged on a two-step video processing framework, capturing activities from two perspectives: overhead and hand-focused. Through experimentation with 27 distinctive models varying in viewpoint, feature extraction method, and the architecture of the segmentation model, we identified improvements in temporal segmentation precision measured with the F1@IoU metric. Our findings highlight the limitations of basic Transformer models in action segmentation tasks, due to their lack of inductive bias and the limitations of a smaller dataset scale. Conversely, the 1D CNN and biLSTM architectures demonstrated proficiency in temporal data modeling, advocating for architectural adaptability over mere scale. The results contribute to the field by underscoring the interplay between model architecture, feature extraction method, and viewpoint integration in refining time study methodologies.

Keywords:

action segmentation; deep learning; time study; manufacturing processes

1. Introduction

Industry 4.0 marks a transformative shift in manufacturing, characterized by the integration of machines and humans within complex Cyber-Physical Systems (CPS) driven by extensive data collection through sensor networks [1,2]. In this advanced technological landscape, the agility of the system increasingly depends on skilled and trained employees [3], necessitating a re-evaluation of the interaction dynamics between machines and humans [1,4]. A pivotal aspect of this evolution is the efficiency of the human factor, which continues to significantly impact the overall productivity of manufacturing systems. This necessitates the monitoring and quantification of human efficiency, for which time-based metrics have emerged as a good indicator [5]. Traditional time study methodologies, including the use of stopwatches or Predetermined Time Standards (PTS) [5], alongside tools like worksheets or video recordings [6], have been the norm. However, these methods exhibit inefficiencies, such as the inability for real-time analysis in manual observation processes, limiting the scope and frequency of analyses [7]. Furthermore, manual video analysis is time-consuming, often requiring two to five hours for one hour of footage, and it is prone to subjectivity and inconsistency [6].

The field of computer vision, particularly action segmentation or detection from video input, offers a promising solution to these challenges [8]. This approach entails the automated recognition and duration estimation of human activities, with a potential to upgrade work productivity analysis and aid in production planning and load balancing. Traditional machine learning algorithms, that were dominant in human activity recognition until a few years ago [8,9], required extensive domain knowledge and time investment for feature engineering. In contrast, deep learning methods circumvent manual feature extraction through the hierarchical learning of complex data representations, although they demand significant data and computational resources [10].

The deficiencies of existing time study methods and technological trends are the motivation for researching artificial intelligence techniques aimed at automating the recognition and estimation of the duration of human activities in manufacturing processes. This study aims to explore the application of deep learning models to enhance time study methodologies, by developing multiple action segmentation models and evaluating their efficiency on real manufacturing data. To achieve this goal, a sample was collected from a real manufacturing process, which consists of nine work activities. During the video recording of the process, the work activities were performed by four subjects on three different types of products, while the recording itself was performed from two different viewpoints. A total of 27 different models have been developed, which differ with respect to recording viewpoint, feature extraction method, and model architecture responsible for action segmentation.

2. Related Work

The problem of the simultaneous recognition and time segmentation of activities from video input, known as action segmentation or action detection, remains a critical area in computer vision research. The objective is to construct models capable of temporally segmenting and recognizing activities in untrimmed videos containing multiple temporal segments. “Action detection” typically refers to processing videos consisting of a predominantly background class (unlabeled activities) to detect sparse segments with labeled activities [11]. In contrast, “action segmentation” involves processing densely labeled videos and predicting action at every frame [12], a common scenario in manufacturing processes. This review focuses on deep learning-based action segmentation models and approaches used in a manufacturing context.

Early attempts to diminish the dimensionality of input data and define descriptive features commonly employed 2D convolutional neural networks (CNNs) such as VGG16 [12,13,14,15,16], VGG19 [17], AlexNet [18], and ResNet50 [19,20]. These models, pretrained on datasets like ImageNet, were fine-tuned for specific tasks. Since 2D CNNs can only capture spatial features from individual images, features derived from optical flow computations [14,15,16,20] or similar, less computationally demanding algorithms [12,13,21] were often used to incorporate a limited temporal context. Some researchers have employed 3D CNNs like C3D [22,23], which consider both spatial and short-term temporal components. Papers [11,24] combined 3D CNNs and optical flow for enhanced temporal feature representation. However, the main drawback of 3D CNNs is their computational cost compared to 2D CNNs.

Long Short-Term Memory (LSTM) and 1D CNN models have been popular for classification in action segmentation. For example, paper [13] utilized bidirectional LSTMs (biLSTM) to leverage information from both ends of a time series. This study introduced an approach using a combination of four distinct 2D CNN models, two focusing on full-image dimension data and the other two trained on person-centric data, to capture both location-dependent and -independent activity information fed into the biLSTM model. Paper [12] proposed an encoder–decoder architecture based on 1D convolutional layers hierarchy, designed to capture long-term temporal regularities. This model, compared to [13], showed much faster learning (30 times faster) on the same dataset, with better or comparable efficiency. Research study [14] combined ideas from [12,13], employing the architecture from [12] but replacing the encoder’s 1D CNN with biLSTM. These modifications resulted in slightly higher accuracy but slower learning. Study [19] suggested the simultaneous training of feature extraction and classification models, with ResNet50 for feature extraction and LSTM for classification, albeit requiring the processing of 2 s video clips. The results were marginally better than training the models separately but with a longer training time for the integrated model. In paper [16], improvements were proposed over similar research like [12,14] by designing a model with two parallel temporal branches. The branch responsible for analyzing videos at full temporal resolution for fine-grained action segmentation was named the residual branch. The other branch used downsampling and upsampling to process videos at various temporal resolutions, improving frame-by-frame classification accuracy. These branches were linked using a deformable temporal residual module employing deformable convolutional filters. However, this model struggled with accurately segmenting shorter actions between longer segments.

Ishikawa et al. [25] introduced an ASRF framework to address excessive segmentation in existing models like MS-TCN [24]. This framework had three phases, using features extracted via models like I3D [26] and refined via MS-TCN. The HASR framework was developed in [27] to refine predictions from models such as MS-TCN, establishing a hierarchy of features at both segment and video levels and refining the output with additional models like the Gated Recurrent Unit (GRU). Kaku et al. [28] addressed the limitations of existing approaches in processing very short activities and proposed an encoder–decoder architecture, with MS-TCN as the encoder and LSTM or biLSTM as the decoder.

Transformers have received much attention regarding their capabilities in solving sequence-to-sequence tasks related to natural language processing. This motivated research into their application in action segmentation because of the apparent similarity between these two problems. Yi et al. [29] proposed a Transformer model for action segmentation, adapting the original architecture to video data. They restricted the attention mechanism to local neighborhoods and re-engineered the decoder to learn temporal relations between segments. Paper [30] introduced a similar idea, combining a 1D convolution and attention mechanism in an encoder–decoder architecture. The authors of paper [31] suggested an alternative approach to action segmentation, advocating for direct segment prediction using an autoregressive method akin to natural language translation tasks. Their model was grounded in a modified Transformer architecture. A model entirely based on the Transformer architecture was proposed in paper [32], and its structure resembled a U-net network used in image segmentation, designed to build a feature hierarchy in the encoder and decoder. This study utilized local attention limited to the neighborhood of each frame.

In the context of manufacturing, a limited number of papers were found, with notable works being [6,7,33,34,35,36,37] The application of classic machine learning models was predominant in these papers, where models like Hidden Markov Models [6,34,35] or Support Vector Machines [7] were employed following the manual extraction of features. Makantasis et al. [33] applied a deep learning model based on a 2D convolutional neural network and multi-layer perceptron, using manually created features with the Motion History Image algorithm. Zhang et al. [36] proposed an encoder–decoder framework for workflow recognition using a 3D CNN, transforming the activations of the last convolutional layer into clip-level representations, which were then fed into an LSTM network with an attention mechanism for enhanced recognition. In paper [37], a system comprising three stages was developed: spatial feature extraction using a Vectors Assembly Graph (VAG) and graph networks from RGB-D video frames; contact force feature extraction via a sliding window technique; and action segmentation through a multi-stage temporal convolution network (MS-TCN) that combines these features. Jiang et al. [7] collected data in laboratory conditions, while studies [33,34,35] were conducted using a dataset described in [38], which is no longer publicly available. Rude et al. [6] used a dataset from [39], collected with a depth sensor for painting manufactured parts, recording data over a single workday with two different workers. The XIOLIFT dataset was utilized in [36], capturing data from six production lines in an elevator factory over one working month. A newly created dataset of 11 assembly actions performed in laboratory conditions, featuring RGB-D videos and contact force data, was used in paper [37].

The need for further research in this domain is evident, particularly in applying deep learning algorithms to real manufacturing data without manual feature engineering. This gap in research, especially in manufacturing contexts, highlights the potential of deep learning models for the simultaneous recognition and temporal segmentation of human activities, a task still underexplored in various human endeavors, including manufacturing.

3. Problem Formulation

Our goal was the development of a deep learning model with the capability of recognizing and temporally segmenting a series of human activities from videos collected in manufacturing processes. The intention was to investigate the effectiveness of different model architectures for action segmentation. We also wanted to explore the effects of different feature extraction methods, such as fine-tuning 2D CNN models on this dataset and training 2D CNN models exclusively on this dataset. We were interested in the effects of different viewpoints on the models’ results. Our research task involved processing video frames

X_{1}^{T} = (x^{(1)}, x^{(2)}, \dots, x^{(T)})

, with the goal of inferring the class label of each frame

c_{1}^{T} = (c^{(1)}, c^{(2)}, \dots, c^{(T)})

, where T is the video length. This task required identifying specific activities within the video and determining their durations.

4. Dataset

We collected a dataset from the manual assembly process of metal grills in a heating, ventilation, and air conditioning system (HVAC). Data collection spanned seven working days, with recordings conducted in manufacturing conditions (see Figure 1). This setting differentiates our dataset from those commonly found in action segmentation literature, which are usually collected in controlled environments or include videos of everyday human activities. The collected videos contain interactions between the worker, product parts, and tools. With minimal differences between individual activity classes, the identification of certain activities relies on models capable of capturing temporal context. Additionally, the background’s appearance remains constant throughout the videos, offering no information about the activity class.

The assembly activities were conducted on three different product types of varying dimensions by four workers, each with a unique ID (O1, O2, O3, O4), as illustrated in Figure 2. This variability captures the human factor in our dataset. The products within the dataset are labeled as T1, T2, and T3.

The videos were recorded from two viewpoints: an overhead (OH) viewpoint and a hand-focused (HF) viewpoint, as shown in Figure 3. This multi-viewpoint approach was driven by the objective of assessing the impact of viewpoint on model performance. The use of OH and HF viewpoints was inspired by research in [40], which suggested that using lower resolution images could accelerate the learning process but at the cost of losing vital visual details. Therefore, we opted for different viewpoints with lower resolutions focused on the work object.

The recordings were made using two cameras at a resolution of 1920 × 1080 and 30 frames per second in MP4 format. We performed subsampling of the video frames at a rate of 5 frames per second. This formatting aimed to expedite model training, enable learning with the existing infrastructure, and address the redundancy in adjacent video frames. Subsequently, videos were segmented into shorter clips based on specific criteria: each clip had to encompass all assembly process activities, including background activities, without exceeding two minutes. This process yielded 620 clips with a total duration of 10 h.

Activity labeling and duration in the video clips were crucial for employing supervised machine learning algorithms. The assembly process included nine work activities for assembling the product and placing it on the shelf, as demonstrated in Figure 4.

We developed precise labeling rules to define the start and end moments of each activity, ensuring consistent data labeling, as shown in Figure 5, for activity 1. Each frame was labeled with the corresponding activity, including a background activity label (label 0), with a time resolution of 0.2 s between frames implicitly indicating activity duration.

For data preparation, we split the dataset into training, validation, and test sets: 480 videos for training and 70 videos each for validation and testing. The split was conducted through random sampling with stratification based on worker ID and product type to ensure that the dataset’s population strata were reflected in all subsets, aiming for better model generalization.

The dataset was designed to have a uniform number of samples per worker and product type, which can be seen from Figure 6.

The typical video duration (see Figure 7), with a median value of 52.8 s, varied depending on the worker and product type, ranging from 36.4 to 90.8 s. Yellow highlight indicates position of a median value.

However, total assembly time, measured from the start of the first to the end of the last activity, often differed from video duration due to the presence of background activities. These activities are a consequence of variations in the assembly process. A comparison of total assembly cycle time per worker and product type, using the median and interquartile range, is provided in Figure 8.

Finally, the distribution of individual activity durations varies among different workers and product types, as detailed in Figure 9, where “Me” stands for median value.

5. Methodology

The procedure used in the development of models in this paper is shown in Figure 10, which highlights the fact that experiments were conducted with three variable factors: recording viewpoint, feature extraction method from video frames, and model architecture for action segmentation.

As mentioned in the previous section, data from two recording viewpoints were used: data collected by a camera placed above the head of the activity performer (OH) and data from a frame focused on the performer’s hands (HF). The third state of the recording viewpoint factor refers to the fusion of data from both viewpoints, denoted as OH + HF.

The factor of the feature extraction approach and the factor of the final model architecture are a direct consequence of the two-step video processing approach commonly used in the field of action segmentation. Although the literature review lists the advantages of end-to-end learning in deep learning models, due to the complexity of video data, the practice is to process through separate steps of feature extraction, and final classification and segmentation.

5.1. Feature Extraction Methods

The purpose of using feature extraction models is to reduce the dimensionality of input data and create descriptive features. The practice of using transfer learning for feature extraction is very common in the context of action segmentation, where all studies from the literature review use some form of transfer learning. Three different approaches were used for feature extraction:

Feat. extraction from a pretrained model without fine-tuning (FE);
Feat. extraction from a pretrained model with fine-tuning on our dataset (FT);
Feat. extraction from a new 2D CNN model trained only on our dataset (TN).

The initial assumption was that transfer learning without fine-tuning (the FE approach) allows for quick extraction of lower-quality features, in the sense that the final models learned on them will be less effective compared to models that will use features obtained by some of the other two approaches.

In all experiments related to feature extraction methods, the input data was preprocessed in such a way that the dimensions of the input video frames were

X^{(i)} \in R^{224 \times 224 \times 3}

. Also, for experiments related to the FE and FT methods, pixel values in individual channels of images were centered by subtracting the arithmetic means calculated on the ImageNet dataset. The result of feature extraction from image

X^{(i)}

in each approach was a feature vector

x^{(i)} \in R^{2048}

.

We conducted the experiment with the FE method using the original ResNet50 model described in [41], which was trained on the ImageNet dataset. Feature extraction was performed by removing the classification layer from the ResNet50 model and making a forward pass through the model for each frame from both viewpoints.

Feature extraction from the pretrained model with fine-tuning on our dataset was carried out by removing the classification layer from the ResNet50 model. We added three new fully connected layers with 2048, 512, and 10 neurons to the ResNet50 base model, with dropout regularization applied to the output vectors of the first and second fully connected layers, with a neuron dropout rate of 50%. After that, models were fine-tuned on our dataset. We trained two models, one on images from the videos recorded from the OH viewpoint and the other on images from the HF viewpoint. The number of parameters in these models was 28,785,162, of which 5,250,570 belonged to the newly added layers. Models were trained using the cross-entropy loss function, and ADAM [42] was chosen as an optimization method, with a learning rate of

1 \times 10^{- 4}

and a batch size of 64 images. Only newly added layers were trained. Both models were trained until convergence, which was determined by the criteria of early stopping in the case of an increase in the ratio of loss on the validation set and loss on the training set above the defined threshold, or stagnation in a decrease of loss on the validation set in three consecutive epochs. After that, we removed the last two layers and extracted features in a similar way as in the FE method.

The last approach to feature extraction (TN) was based on a model that was trained only on images from our dataset. The developed model (see Figure 11) uses a type of residual block in its architecture that was introduced in study [43]. In developing the model, good practices observed in the ResNet50 architecture were used.

The model contains 8,040,662 parameters, which is about three times less than ResNet50. This makes sense, considering that this model was trained on hundreds of thousands of images compared to ResNet50, which was trained on the ImageNet dataset. Again, we trained two models: one for OH images and the other for HF images. Models were trained using cross-entropy as the loss function and stochastic gradient descent, with a momentum factor of 0.9 and a batch size of 64 images. Due to the use of residual blocks with skip connections, a learning rate of

1 \times 10^{- 1}

was chosen to accelerate model convergence. Feature extraction from the developed models followed a similar pattern as the previous two approaches, the only difference being in the preparation of the input images, for which pixel scaling was performed in the range [0, 1]. After removing the last fully connected layer of the model, feature vectors of dimension 2048 were calculated for each image.

Feature vectors for the OH + HF viewpoint were obtained by concatenating the calculated feature vectors of both viewpoints for the same image into a combined vector of dimension 4096. These calculated vectors were used as input for action segmentation models.

5.2. Action Segmentation Models

The literature review pointed to three types of architectures that dominate in the action segmentation research. These are architectures based on LSTM/biLSTM, 1D CNN, and Transformers (see Figure 12).

The developed models received input data in the form of a sequence

X_{1}^{T} = (x^{(1)}, x^{(2)}, \dots, x^{(T)})

of total length T. The task of the models was to make a prediction of the activity class occurring at each timestep in the output sequence

{\hat{Y}}_{1}^{T} = ({\hat{y}}^{(1)}, {\hat{y}}^{(2)}, \dots, {\hat{y}}^{(T)})

, where the member of the prediction in the output sequence at timestep t was a vector

{\hat{y}}^{(t)} \in R^{10}

, for which the sum

\sum_{i = 1}^{10} y_{i}^{(t)} = 1

.

We used the random search procedure [44] for the selection of optimal hyperparameters for all the models. The cross-entropy loss function was used for model training. The stochastic gradient descent optimization method with a momentum of 0.9 and a learning rate of

1 \times 10^{- 2}

was used in the case of the biLSTM and 1D CNN models, while ADAM, with a learning rate of

1 \times 10^{- 3}

, was used in the case of the Transformer model. The learning rate was selected using the procedure described in paper [45]. The mini-batch size in the experiments was 32.

biLSTM models: Bidirectional Long Short-Term Memory networks are an advanced variant of recurrent neural networks (RNN), adept at capturing long-range dependencies in sequential data. Their design addresses the limitations of traditional RNNs, particularly the vanishing gradient problem, by introducing a gated mechanism with additional paths for gradient flow and learning. Another limitation of recurrent neural networks is that they make predictions at the current timestep based on the current element of the input sequence and all previous hidden states. biLSTM allows for the use of information from the entire sequence at each timestep, a useful property if the application of the model has no limitations in terms of the availability of future timesteps. biLSTM can be implemented using two parallel LSTM layers, each processing the sequence in a different direction. The operations within the biLSTM unit can be mathematically described as follows:

f^{(t)} = σ (W_{f} x^{(t)} + U_{f} h^{(t - 1)} + b_{f})

(1)

i^{(t)} = σ (W_{i} x^{(t)} + U_{i} h^{(t - 1)} + b_{i})

(2)

{\tilde{c}}^{(t)} = \tanh (W_{\tilde{c}} x^{(t)} + U_{\tilde{c}} h^{(t - 1)} + b_{\tilde{c}})

(3)

o^{(t)} = σ (W_{o} x^{(t)} + U_{o} h^{(t - 1)} + b_{o})

(4)

c^{(t)} = f^{(t)} ⨀ c^{(t - 1)} + i^{(t)} ⨀ {\tilde{c}}^{(t)}

(5)

h^{(t)} = o^{(t)} ⨀ \tanh (c^{(t)})

(6)

In Equations (1)–(6),

σ (x) = \frac{1}{1 + e^{- x}}

is a sigmoid function that maps inputs to the range [0, 1], and

t a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

is the hyperbolic tangent function that outputs values in the range [−1, 1].

W_{(\cdot)} \in R^{H \times D}, U_{(\cdot)} \in R^{H \times D}

, and

b_{(\cdot)} \in R^{H}

are model parameters and

⨀

is an element-wise multiplication operator.

f^{(t)} \in R^{H}

is a forget gate that decides which information to discard from the cell state

c^{(t)} \in R^{H}

, guided by the sigmoid function. The input gate

i^{(t)} \in R^{H}

and the modulation gate

{\tilde{c}}^{(t)} \in R^{H}

collectively update the cell state with new information. The cell state update, as defined in Equation (5), combines past information with new insights, ensuring relevant information is retained. The output gate

o^{(t)} \in R^{H}

decides which parts of the cell state contribute to the output at the current timestep

t

. The hidden state

h^{(t)} \in R^{H}

, is then updated to reflect this output.

The hyperparameters of biLSTM models were adjusted as shown in Table 1, with their ranges determined based on the literature and preliminary experiments.

1D CNN models: The convolutional models used in our experiments were based on two types of dilated convolutional blocks, as identified in the research [11,27]. The use of dilated 1D convolutional layers in the architecture allows for efficient modeling of temporal context. Dilation works on the principle of expanding the kernel by adding gaps between individual elements, thus capturing a longer temporal context without increasing the number of parameters, due to the expansion of the receptive field. The first type of block is the dilated residual block, which begins with a dilated convolutional layer with a kernel size of 3 (

{\tilde{W}}_{l} \in R^{3 \times H \times H}

). The dilation factor for the layer

l

is set to

2^{l}

, with the number of kernels

H

being a hyperparameter adjusted based on experimental results. This approach allows for the capturing of increasingly larger temporal contexts as the model deepens. This block also includes a non-linear transformation using the ReLU activation function, followed by a convolution with a kernel size of 1 (

W_{l} \in R^{1 \times H \times H}

) for controlling the model’s representational capacity. A dropout layer is added to mitigate overfitting, along with a skip connection to facilitate the learning process. The operations within this block can be mathematically described as follows:

{\tilde{H}}_{l} = R e L U ({\tilde{W}}_{l} * H_{l - 1} + {1 \tilde{b}}_{l}^{T})

(7)

H_{l} = H_{l - 1} + W_{l} * {\tilde{H}}_{l} + 1 b_{l}^{T}

(8)

In Equations (7) and (8)

H_{l - 1} \in R^{T \times H}

and

H_{l} \in R^{T \times H}

are the input and output of dilated residual block

l

;

{\tilde{b}}_{l} \in R^{H}

and

b_{l} \in R^{H}

are bias terms;

*

represents the convolution operator; and

1

is a vector of ones of dimension T. The second type of block is the dual dilated block, consisting of two convolutional branches. In one branch, the dilation factor of the convolutional layer increases with the depth of the model, similar to the dilated residual block. In the other branch, it decreases from the largest value to the smallest, according to the expression

2^{L - l}

, where

L

is the depth of the last layer and

l

is the depth of the current layer. The goal is to simultaneously capture both short-term and long-term temporal contexts present in the data. Outputs from these two branches are merged and passed through ReLU activation, with a convolution of dimension 1 used to adjust the dimensionality. Like the dilated residual block, it also includes a dropout layer and a skip connection. The operations are summarized as follows:

{\tilde{H}}_{l, 1} = {\tilde{W}}_{l, 1} * H_{l - 1} + {1 \tilde{b}}_{l, 1}^{T}

(9)

{\tilde{H}}_{l, 2} = {\tilde{W}}_{l, 2} * H_{l - 1} + {1 \tilde{b}}_{l, 2}^{T}

(10)

{\tilde{H}}_{l} = R e L U ([{\tilde{H}}_{l, 1}, {\tilde{H}}_{l, 2}])

(11)

H_{l} = H_{l - 1} + W_{l} * {\tilde{H}}_{l} + 1 b_{l}^{T}

(12)

In Equations (9)–(12)

{\tilde{W}}_{l, 1}

{\in R}^{3 \times H \times H}

and

{\tilde{W}}_{l, 2}

{\in R}^{3 \times H \times H}

are the parameters of the dilated convolution kernels in the branches with increasing and decreasing dilation factors.

W_{l} \in R^{1 \times 2 H \times H}

is the parameter of the convolution with a kernel size of 1. The architecture of the convolutional model containing these two types of blocks (see Figure 12b) can be divided into two parts, which in the literature are called the prediction generation stage and the prediction refinement stage. Preliminary experiments indicated that only one stage of prediction generation was sufficient for models developed on the collected sample. The number of refinement stages was one of the hyperparameters adjusted in the experiments (see Table 2).

Transformer models: The Transformer architecture offers a distinct approach to handling sequential data. Unlike RNN-based models, the Transformer operates without recurrence, relying on self-attention mechanisms to process entire sequences simultaneously. This design allows for the efficient modeling of long-range dependencies and parallel processing, especially in cases where temporal dynamics span over various lengths. Convolutional architectures, despite being parallelizable, struggle to model long-term contexts without the stacking of multiple layers, which can increase complexity and computational load. As the Transformer lacks recurrent mechanisms, it employs positional encodings to incorporate sequence order. Positional encodings

P \in R^{T \times D}

are added to the input sequence to provide the model with information about the relative or absolute position of the elements in the sequence, and we us those proposed in the original Transformer paper [46] based on sine and cosine functions. The core of the Transformer is the multihead attention mechanism, which allows the model to focus on different parts of the sequence simultaneously. The attention mechanism is applied independently in parallel heads, allowing the model to capture different aspects of the input sequence. In a single head attention, the input

X \in R^{T \times D}

is transformed into three matrices called queries (

Q \in R^{T \times D_{q}}

), keys (

K \in R^{T \times D_{k}}

), and values (

V \in R^{T \times D_{v}}

), and the attention mechanism can be mathematically described as:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{D_{k}}}) V

(13)

where

Q K^{T}

is the dot product similarity score for each query–key pair, which is calculated to determine how closely related each query (representing a specific segment or focus in the manufacturing assembly sequence) is to all possible keys (representing different parts or aspects of the video sequences). These scores are then scaled down by dividing them by the square root of the dimension of the keys, which helps in stabilizing the gradients during training. Applying softmax normalizes these scores, so they all add up to one, forming a probability distribution. This normalization allows the model to focus more on the keys that are more relevant to each query. Once we obtain the attention weights (the output of the softmax function), they are used to appropriately weight the values (

V

). In the context of action segmentation, these values represent the detailed content or features of the actions in the video sequences. By weighting these values according to the attention scores, the model selectively focuses on the most informative parts of the video for understanding and classifying the actions in each segment.

In our study, we aim to explore the effectiveness of a Transformer encoder-only architecture in the task of action segmentation, particularly focusing on its capability without integrating dilated temporal convolutional layers. This choice is motivated by the characteristics of our dataset, which comprises videos with shorter time spans. In such scenarios, the global attention mechanism of the Transformer, unrestricted to local neighborhoods, could potentially capture the temporal dynamics more effectively than when limited to local connectivity.

We are deliberately not including dilated temporal convolutional layers in our architecture. The rationale behind this is to assess the innate ability of the Transformer encoder to handle the action segmentation task in its more original form. This approach will allow us to understand the fundamental strengths and limitations of the Transformer encoder in processing sequential video data for action segmentation. Previous studies in this domain have identified that the original decoder architecture of the Transformer is not particularly well-suited for the nuances of the action segmentation task [29]. By focusing solely on the encoder architecture, we aim to investigate how well the Transformer can model the sequential relationships in action data without the influence of a decoder. Hyperparameters used in experiments are given in Table 3.

6. Evaluation

Two metrics were selected for model evaluation on our dataset: frame-wise accuracy and a segmental F1@IoU [12] score with a defined minimum overlap threshold of 50%, in which the overlap is determined by calculating the ratio between intersection and union (IoU) of a true and predicted segment. Accuracy measures the proportion of accurately classified frames in a video relative to the total number of frames. The drawback of this metric is that models of similar accuracy can have large qualitative differences in terms of recognized segments in the video (a good example of this is given by Figure 5 in [13]). F1@IoU has three properties that make it suitable for evaluating models in the domain of action segmentation. The first property is that it penalizes over-segmentation errors, and the second property is that it does not penalize minor time shifts that may be due to variability and subjectivity in labeling, while the final property is that results depend on the number of activities rather than their duration. The accuracy metric and F1 metric are complementary because they penalize different model errors, which is why we selected both metrics as part of the procedure for the evaluation of models.

7. Discussion

Viewpoints Impact: Utilizing different viewpoints, specifically Overhead (OH) and Hand-Focused (HF), has been shown to influence model performance. The OH viewpoint provides a comprehensive perspective, capturing large-scale activities, but it may miss finer details critical for complex tasks. Conversely, the HF viewpoint excels in capturing detailed hand movements but may lose sight of the broader context. Combining OH and HF (OH + HF) aims to merge these strengths, leading to a more balanced and comprehensive understanding of activities. This integrated approach improves the F1@IoU scores, as demonstrated by the results in Table 4, where the three best models use this approach, indicating enhanced segmentation precision by incorporating a variety of visual cues.

Feature Extraction Methods: The results outlined in Table 4 reveal that models using the FE method for feature extraction displayed higher frame-wise accuracy but lower F1@IoU scores. This suggests that, while these models accurately classify individual frames, they struggle with precise action boundary delineation. The general features learned from extensive datasets like ImageNet may not provide the specificity needed for accurate temporal segmentation in our dataset. In contrast, the FT method showed significant improvements in F1@IoU scores, reflecting the success of adapting pretrained features to our dataset’s unique characteristics. However, the benefits of the TN method, which involves training a model from scratch on our dataset, are not as clearly defined. The impact on the F1@IoU metric is not a marked improvement over the FT method, especially considering the resource-intensive nature of training a new model, suggesting a delicate balance between dataset specificity and pretrained features.

Action segmentation model type: The Transformer (TFORM) models, designed with an emphasis on the encoder architecture and devoid of dilated temporal convolutional layers, were expected to leverage the global attention mechanism for capturing broad temporal relationships. While their global attention mechanism effectively captured broader temporal relationships, as indicated by their accuracy scores, they fell short in F1@IoU performance. This suggests that the global attention mechanism, while powerful, may not be sufficient for the fine-grained segmentation required in complex action segmentation tasks. The number of parameters in these models, while substantial, did not directly correlate with superior F1@IoU performance, suggesting that architectural choices play a more significant role than model size in this context. A potential reason for this could be the inherent lack of inductive bias in the Transformer architecture, which, when combined with a relatively smaller dataset, could impede the model’s ability to learn fine-grained temporal dynamics essential for high F1@IoU scores. Furthermore, the 1D CNN and biLSTM models demonstrated a commendable balance between accuracy and F1@IoU scores. These architectures’ ability to efficiently handle temporal data, despite having fewer parameters than some Transformer models, underscores the importance of architectural suitability over sheer model size for specific tasks like action segmentation.

Segmentation challenges: By observing the qualitative results in Figure 13, specifically in the column “Worst segmentation results”, it is evident that all model types face challenges in distinguishing background activities (labeled as light yellow) from other actions. The difficulty with background activities could stem from their diverse nature and under-representation in the dataset. The inconsistency in appearance makes it challenging for models to establish stable prediction rules, leading to lower F1@IoU scores in the worst-case scenarios. Conversely, the best-case scenarios showcase instances where background activities are minimal, or the models have learned to segment actions with high precision due to specific contextual features present in those examples.

8. Conclusions

This study successfully demonstrates the application of deep learning action segmentation models to improve time study methods in manufacturing, aligning with the operational demands of Industry 4.0. By combining different viewpoints and utilizing advanced feature extraction methods, we have advanced the precision of action segmentation models, as evidenced by the enhanced F1@IoU scores. This research contributes to the broader context by addressing the pressing need for automated and precise human activity analysis in complex manufacturing environments.

The evaluation of our models reveals a nuanced understanding of the interconnection between model architecture, feature extraction methods, and viewpoints. The fine-tuning method, which adapts pretrained features to a specific dataset, proved to be particularly effective, suggesting that contextualized learning is crucial for temporal segmentation. Our analysis also highlights the limitations inherent in basic Transformer models for action segmentation tasks, as their global attention mechanism does not compensate for the absence of inductive bias, especially when working with limited datasets.

A key limitation of our research lies in the dataset’s scope, which, while diverse, may not capture the full complexity of industrial activities. Furthermore, the under-representation of background activities points to the need for datasets that mirror the variability and unpredictability of real-world manufacturing scenarios more closely.

Moving forward, several avenues for future research emerge. Investigating the scalability of models to larger and more varied datasets, the integration of multi-modal data sources, and the real-time application of these models in manufacturing settings are pivotal. Additionally, exploring the augmentation of basic Transformer models with structural inductive biases could bridge their current performance gap in the context of manufacturing video datasets.

This research contributes to the domain of automated time studies by providing a comprehensive analysis of deep learning-based action segmentation models. The insights garnered here will serve as a motivation for further advancements, ensuring that human efficiency continues to be a cornerstone of productivity in the era of smart industry.

Author Contributions

Conceptualization, M.G. (Mihael Gudlin); Data curation, M.G. (Mihael Gudlin) and M.G. (Matija Golec); Investigation, M.G. (Mihael Gudlin) and M.H.; Methodology, M.G. (Mihael Gudlin) and M.H.; Software, M.G. (Mihael Gudlin) and D.K.; Validation M.G. (Mihael Gudlin) and D.K.; Writing—original draft, M.G. (Mihael Gudlin), M.G. (Matija Golec) and M.H.; Writing—review and editing, M.G. (Mihael Gudlin) and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Regional Development Fund, grant number KK 01.2.1.02.0226.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available. This data can be found here: https://drive.google.com/drive/folders/1NHpEO-vGIkpNCQqaqfFRyyYHMTqYFpaj?usp=sharing.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Romero, D.; Bernus, P.; Noran, O.; Stahre, J.; Fast-Berglund, Å. The Operator 4.0: Human Cyber-Physical Systems & Adaptive Automation towards Human-Automation Symbiosis Work Systems. In Advances in Production Management Systems. Initiatives for a Sustainable World; Nääs, I., Vendrametto, O., Mendes Reis, J., Gonçalves, R.F., Silva, M.T., von Cieminski, G., Kiritsis, D., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 488, pp. 677–686. [Google Scholar] [CrossRef]
Xu, L.D.; Duan, L. Big data for cyber physical systems in industry 4.0: A survey. Enterp. Inf. Syst. 2019, 13, 148–169. [Google Scholar] [CrossRef]
Pfeiffer, S. Robots, Industry 4.0 and Humans, or Why Assembly Work Is More than Routine Work. Societies 2016, 6, 16. [Google Scholar] [CrossRef]
Posada, J.; Zorrilla, M.; Dominguez, A.; Simoes, B.; Eisert, P.; Stricker, D.; Rambach, J.; Döllner, J.; Guevara, M. Graphics and Media Technologies for Operators in Industry 4.0. IEEE Comput. Graph. Appl. 2018, 38, 119–132. [Google Scholar] [CrossRef]
Abdullah, R.; Abdul Rahman, M.d.N.; Salleh Mohd, R. A systematic approach to model human system in cellular manufacturing. J. Adv. Mech. Des. Syst. Manuf. 2019, 13, JAMDSM0001. [Google Scholar] [CrossRef]
Rude, D.J.; Adams, S.; Beling, P.A. Task recognition from joint tracking data in an operational manufacturing cell. J. Intell. Manuf. 2018, 29, 1203–1217. [Google Scholar] [CrossRef]
Jiang, Q.; Liu, M.; Wang, X.; Ge, M.; Lin, L. Human motion segmentation and recognition using machine vision for mechanical assembly operation. SpringerPlus 2016, 5, 1629. [Google Scholar] [CrossRef]
Zhang, S.; Wei, Z.; Nie, J.; Huang, L.; Wang, S.; Li, Z. A Review on Human Activity Recognition Using Vision-Based Method. J. Healthc. Eng. 2017, 2017, 3090343. [Google Scholar] [CrossRef]
Zhu, F.; Shao, L.; Xie, J.; Fang, Y. From handcrafted to learned representations for human action recognition: A survey. Image Vis. Comput. 2016, 55, 42–52. [Google Scholar] [CrossRef]
Ding, G.; Sener, F.; Yao, A. Temporal Action Segmentation: An Analysis of Modern Techniques. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1011–1030. [Google Scholar] [CrossRef]
Li, S.-J.; AbuFarha, Y.; Liu, Y.; Cheng, M.-M.; Gall, J. MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 6647–6658. [Google Scholar] [CrossRef]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
Singh, B.; Marks, T.K.; Jones, M.; Tuzel, O.; Shao, M. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1961–1970. [Google Scholar] [CrossRef]
Ding, L.; Xu, C. TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation. arXiv 2017, arXiv:1705.07818. [Google Scholar]
Bai, R.; Zhao, Q.; Zhou, S.; Li, Y.; Zhao, X.; Wang, J. Continuous Action Recognition and Segmentation in Untrimmed Videos. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2534–2539. [Google Scholar] [CrossRef]
Lei, P.; Todorovic, S. Temporal Deformable Residual Networks for Action Segmentation in Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6742–6751. [Google Scholar] [CrossRef]
Ma, S.; Sigal, L.; Sclaroff, S. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1942–1950. [Google Scholar] [CrossRef]
Bodenstedt, S.; Rivoir, D.; Jenke, A.; Wagner, M.; Breucha, M.; Müller-Stich, B.; Mees, S.T.; Weitz, J.; Speidel, S. Active learning using deep Bayesian networks for surgical workflow analysis. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 1079–1087. [Google Scholar] [CrossRef]
Jin, Y.; Dou, Q.; Chen, H.; Yu, L.; Qin, J.; Fu, C.W.; Heng, P.A. SV-RCNet: Workflow Recognition from Surgical Videos Using Recurrent Convolutional Network. IEEE Trans. Med. Imaging 2018, 37, 1114–1126. [Google Scholar] [CrossRef] [PubMed]
Yang, K.; Shen, X.; Qiao, P.; Li, S.; Li, D.; Dou, Y. Exploring frame segmentation networks for temporal action localization. J. Vis. Commun. Image Represent. 2019, 61, 296–302. [Google Scholar] [CrossRef]
Yang, H.; He, X.; Porikli, F. Instance-Aware Detailed Action Labeling in Videos. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1577–1586. [Google Scholar] [CrossRef]
Montes, A.; Salvador, A.; Giró-i-Nieto, X. Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks. arXiv 2016, arXiv:1608.08128. [Google Scholar]
Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.-F. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Farha, Y.A.; Gall, J. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3575–3584. [Google Scholar]
Ishikawa, Y.; Kasai, S.; Aoki, Y.; Kataoka, H. Alleviating Over-segmentation Errors by Detecting Action Boundaries. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2321–2330. [Google Scholar] [CrossRef]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
Ahn, H.; Lee, D. Refining Action Segmentation with Hierarchical Video Representations. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 16282–16290. [Google Scholar] [CrossRef]
Kaku, A.; Liu, K.; Parnandi, A.; Rajamohan, H.R.; Venkataramanan, K.; Venkatesan, A.; Wirtanen, A.; Pandit, N.; Schambra, H.; Fernandez-Granda, C. Sequence-to-Sequence Modeling for Action Identification at High Temporal Resolution. arXiv 2021. [Google Scholar] [CrossRef]
Yi, F.; Wen, H.; Jiang, T. ASFormer: Transformer for Action Segmentation. arXiv 2021, arXiv:2110.08568. [Google Scholar]
Wang, J.; Wang, Z.; Zhuang, S.; Hao, Y.; Wang, H. Cross-enhancement transformer for action segmentation. Multimed Tools Appl. 2023. [Google Scholar] [CrossRef]
Behrmann, N.; Golestaneh, S.A.; Kolter, Z.; Gall, J.; Noroozi, M. Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; Volume 13695, pp. 52–68. [Google Scholar] [CrossRef]
Du, D.; Su, B.; Li, Y.; Qi, Z.; Si, L.; Shan, Y. Do We Really Need Temporal Convolutions in Action Segmentation? In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1014–1019. [Google Scholar] [CrossRef]
Makantasis, K.; Doulamis, A.; Doulamis, N.; Psychas, K. Deep learning based human behavior recognition in industrial workflows. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1609–1613. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Lalos, C.; Stentoumis, C. Human tracking driven activity recognition in video streams. In Proceedings of the 2016 IEEE International Conference on Imaging Systems and Techniques (IST), Chania, Greece, 4–6 October 2016; pp. 554–559. [Google Scholar] [CrossRef]
Arbab-Zavar, B.; Carter, J.N.; Nixon, M.S. On hierarchical modelling of motion for workflow analysis from overhead view. Mach. Vis. Appl. 2014, 25, 345–359. [Google Scholar] [CrossRef]
Zhang, M.; Hu, H.; Li, Z.; Chen, J. Attention-based encoder-decoder networks for workflow recognition. Multimed Tools Appl. 2021, 80, 34973–34995. [Google Scholar] [CrossRef]
Kang, Z.; Cui, J.; Chu, Z. Manual assembly actions segmentation system using temporal-spatial-contact features. RIA 2023, 43, 509–522. [Google Scholar] [CrossRef]
Voulodimos, A.; Kosmopoulos, D.; Vasileiou, G.; Sardis, E.; Anagnostopoulos, V.; Lalos, C.; Doulamis, A.; Varvarigou, T. A Threefold Dataset for Activity and Workflow Recognition in Complex Industrial Environments. IEEE Multimed. 2012, 19, 42–52. [Google Scholar] [CrossRef]
Rude, D.J.; Adams, S.; Beling, P.A. A Benchmark Dataset for Depth Sensor Based Activity Recognition in a Manufacturing Process. IFAC-PapersOnLine 2015, 48, 668–674. [Google Scholar] [CrossRef]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9908, pp. 630–645. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Smith, L.N. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv 2018, arXiv:1803.09820. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]

Figure 1. Manufacturing location.

Figure 2. Four workers in the sample videos.

Figure 3. Different viewpoints of the scene.

Figure 4. Assembly work activities.

Figure 5. Example of criteria for labeling activities in a video clip.

Figure 6. Number of samples per worker and product type.

Figure 7. Distribution of total duration of video samples.

Figure 8. Distribution of assembly cycle time per worker and product type.

Figure 9. Distribution of activity duration.

Figure 10. Methodology resulting in 27 different approaches.

Figure 11. Custom 2D CNN model used for feature extraction.

Figure 12. Architectures used for action segmentation: (a) biLSTM, (b) 1D CNN, and (c) Transformer encoder (hyperparameters adjusted in experiments are shown in parentheses).

Figure 13. Qualitative results for the best approach in terms of the F1@50 metric for each architecture type (* rows in Table 4)—each color labels one activity type.

Table 1. Hyperparameters tuned in experiments with biLSTM architecture.

Label ¹	Hyperparameters	Set of Values Used
M	Number of biLSTM layers	$\{1, 2, 3\}$
BiN	Number of neurons per biLSTM gate	${4, 8, 16, 32, 64, 128, 256, 512, 1024\}$
N	Number of additional fully connected layers	$\{0, 1\}$
FN	Number of neurons in additional fully connected layers	$\{128, 256, 512\}$
DR	Dropout rate	$\{0, 0.1, 0.2, 0.5\}$

¹ Labels of hyperparameters used in Figure 12a.

Table 2. Hyperparameters tuned in experiments with 1D CNN architecture.

Label ¹	Hyperparameters	Set of Values Used
D	Number of dilated residual layers	$\{1, 5, 10\}$
DD	Number of dual dilated layers	${1, 5, 10}$
R	Number of refinement stages	$\{1, 2, 3, 4, 5\}$
CK	Number of kernels in each CNN layer	$\{64, 128\}$
DR	Dropout rate	$\{0, 0.1, 0.2, 0.5\}$

¹ Labels of hyperparameters used in Figure 12b.

Table 3. Hyperparameters tuned in experiments with Transformer encoder architecture.

Label ¹	Hyperparameters	Set of Values Used
T	Number of Transformer encoder layers	$\{1, 2, 8, 16, 32\}$
NH	Number of heads in multihead attention	${2, 4, 8}$
TN	Number of neurons in Transformer hidden layer	${128, 256}$
FN	Number of neurons in two-layer feedforward network	$\{128, 256, 512\}$
DR	Dropout rate	$\{0, 0.25, 0.5\}$

¹ Labels of hyperparameters used in Figure 12c.

Table 4. Evaluation of the best models from each of the 27 approaches ranked by F1@50 metric.

Model Type	Feature Type	Viewpoint	Number of Parameters	Accuracy	F1@50
CONV *	TN	OH + HF	511,572	97.84	93.64
CONV	FT	OH + HF	511,572	97.74	93.46
biLSTM *	FT	OH + HF	262,890	97.78	93.16
CONV	FT	HF	380,500	97.50	92.63
CONV	TN	OH	380,500	97.60	92.48
biLSTM	FT	HF	131,818	97.52	92.45
CONV	TN	HF	380,500	97.41	92.33
biLSTM	TN	OH	135,018	97.37	92.10
biLSTM	TN	OH + HF	262,890	97.66	91.80
CONV	FE	HF	380,500	96.97	90.49
biLSTM	TN	HF	131,818	97.24	90.39
CONV	FE	OH + HF	511,572	97.00	90.34
CONV	FT	OH	380,500	96.75	89.85
biLSTM	FE	HF	264,650	96.33	88.30
CONV	FE	OH	627,860	95.66	86.60
biLSTM	FE	OH + HF	1,057,674	96.18	85.59
biLSTM	FT	OH	131,818	95.53	84.48
TFORM *	TN	OH	8,960,778	96.14	83.50
TFORM	FT	OH + HF	5,268,234	96.45	83.17
TFORM	TN	OH + HF	17,918,730	94.41	82.83
TFORM	FT	OH	17,394,442	95.95	82.37
TFORM	TN	HF	8,960,778	96.06	81.62
TFORM	FT	HF	4,743,946	95.70	80.47
TFORM	FE	OH + HF	2,105,610	95.17	78.03
TFORM	FE	HF	4,743,946	95.42	78.02
biLSTM	FE	OH	25,194,506	90.38	76.14
TFORM	FE	OH	4,743,946	94.08	74.25

* Best models from each architecture type.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gudlin, M.; Hegedić, M.; Golec, M.; Kolar, D. Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models. Appl. Sci. 2024, 14, 1185. https://doi.org/10.3390/app14031185

AMA Style

Gudlin M, Hegedić M, Golec M, Kolar D. Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models. Applied Sciences. 2024; 14(3):1185. https://doi.org/10.3390/app14031185

Chicago/Turabian Style

Gudlin, Mihael, Miro Hegedić, Matija Golec, and Davor Kolar. 2024. "Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models" Applied Sciences 14, no. 3: 1185. https://doi.org/10.3390/app14031185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Time Study Methods Using Deep Learning-Based Action Segmentation Models

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

4. Dataset

5. Methodology

5.1. Feature Extraction Methods

5.2. Action Segmentation Models

6. Evaluation

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI