1. Introduction
Continuous casting is one of the most common methods of producing metal products. This technique involves pouring liquid metal into a mold and continuously withdrawing the solidified product from the other end. The product can be a billet, a bloom, or a slab, depending on its shape and size. Continuous casting reduces the need for intermediate steps and saves energy and material costs [
1]. It is the most frequently used process to cast steel, aluminum, and copper alloys.
Continuous casting involves several stages [
2]: ladle treatments, tundish operation, mold filling and solidification, secondary cooling, strand support, cutting, and straightening. Each step requires careful monitoring and control of various parameters, such as temperature [
3], pressure, flow rate, composition, liquid level, drawing speed, etc. Moreover, continuous casting is influenced by many factors that are difficult to measure or model accurately, for example, turbulence and mixing phenomena in the molten metal, heat transfer across different interfaces (metal-mold-water-air), phase transformations, microstructural evolution during solidification, thermal stresses, strains induced by temperature gradients, and metallurgical reactions between metal and slag or refractory materials, etc. The current continuous casting process also involves two phases, i.e., manual control and automatic control. Once the molten steel in the mold reaches a certain level, the casting machine will be activated, and the continuous casting goes into the automatic control phase from the manual control phase. But during the manual control phase, the casting operator must adjust the stopper position by manually observing the liquid level in the mold.
The anomalous detection of liquid level in molds is a crucial technique for improving the quality of steel products in continuous casting. The fluctuation of the liquid level in the mold is closely related to the casting speed, depth, condition of the Submerged Entry Nozzle (SEN), and argon gas injection [
4]. Excessive fluctuations in the liquid level in the mold can cause slag entrainment, which leads to inclusions and surface defects on the cast billet. Therefore, monitoring and controlling the liquid level in the mold during casting is vital to ensuring a stable and uniform solidification process [
5]. However, conventional methods such as eddy current sensors have limitations in detecting local fluctuations or capturing dynamic changes of the liquid level in mold, and current deep-learning-based techniques are heavily focused on the automatic control phase.
The temporal features of the liquid level in the manual and automatic control phases are different. The steel liquid enters the mold during the manual control phase and does not leave. Once the continuous casting machine is activated, the drawing machine starts drawing solidified steel out of the mold. During the manual control phase, if an anomaly occurs, such as an open error stopper, it may cause the liquid level in the mold to be higher or lower than it is supposed to be. To eliminate the anomaly, the casting operator must manually identify the liquid level anomaly and operate the stopper to eliminate the error. Due to the limitations of manual control, the error in liquid level during manual control lasts longer. Due to different casting requirements, the abnormal stopper sequence can be correctly applied under another requirement or in different casting periods. Therefore, the temporal features between anomaly and normal sequences can share similar features that AEs can mistake.
Therefore, advanced techniques based on deep learning are proposed to detect various types of anomalies in time-dependent process parameters and provide timely feedback for quality control.
Figure 1 shows the required components in a continuous casting process. The liquid level in the mold is controlled by both the stopper and the withdrawal unit.
2. Related Work
Research using neural networks to predict and improve the properties and structure of steel has been conducted. Sarda et al. [
6] proposed a multi-step anomaly detection strategy based on robust distances for predictive maintenance in steel-making industries. The proposed method achieved good results in detecting anomalies in the steel-making process. Acernese et al. [
7] reported the outcome of an industrial research project on data-based anomaly detection in a steel-making production process. The study assesses a fault detection strategy for rotating machines in the hot rolling mill line. Chen et al. [
8] discussed the dynamic bulging model, which captured the behavior of the 2-D longitudinal domain through interpolation of multiple 1-D moving slices. The model calculates the fluctuations of liquid level in the mold caused by unsteady bulging of the solidifying shell, which affect the quality of the steel and the stable operation of the continuous steel casting process. Yoon et al. [
9] analyzed the Mold Level Hunching (MLH) phenomenon during a thin slab casting process. The mold’s liquid level variation and the strand’s bulging were measured and analyzed using Fast Fourier Transform (FFT) spectrum analysis. Mold-level hunching and bulging had the same frequency, and the specific frequency was 0.5 Hz. Zhou et al. [
10] proposed a liquid level in mold anomaly detection method called Multi-scale Convolution Neural Network-Long Short-Term Memory (Multi-scale CNN-LSTM) to detect the anomalies in liquid level in mold multi-dimension time series actual casting dataset. Khalaj et al. [
11] used an Artificial Neural Network (ANN) to predict the passivation current density and potential of microalloyed steels based on the experimental data from the potentiodynamic polarization of High-Strength Low-Alloy (HSLA) steels. The developed model showed a good capacity for modeling complex corrosion behavior and could accurately track the experimental data in a wide range of steel chemical compositions, microstructures, temperature ranges, and corrosion cell characteristics.
TSAD aims to identify unusual patterns or behaviors in sequential data [
12,
13,
14,
15,
16,
17]. TSAD has applications in various domains, such as smart grids, network security, finance, health care, and social media. However, TSAD is also challenging, as anomalies can have different types, scales, and contexts, and time series data can be noisy, high-dimensional, and non-stationary.
Neural-network-based TSAD demonstrates that it can achieve strong results on various datasets. These methods learn long-term, nonlinear temporal relationships in the data, outperforming existing non-deep methods based on similarity search [
18] and density-based clustering [
19].
The most popular TSAD framework is the AutoEncoder (AE). Recurrent Neural Network (RNN) AE, Long Short-Term Memory (LSTM) AE, and Gated Recurrent Unit (GRU) AE are three types of AE for TSAD. RNN-AE uses a simple recurrent unit as the encoder and decoder. LSTM-AE uses the LSTM unit as the encoder and decoder. GRU-AE uses GRUs as the encoder and decoder. These three models have different advantages and disadvantages regarding computational efficiency, memory capacity, and gradient flow. AEs can be used for TSAD by training them on normal data and measuring the reconstruction error on new data. A high reconstruction error indicates an anomaly, while a low reconstruction error indicates a normal data point. AE can also be extended to variational AE (VAE) [
11], which imposes a probabilistic distribution on the latent space and can generate realistic data samples. Robust AE (RAE) [
20], a method inspired by Robust Principal Component Analysis (RPCA) [
21], used an AE and an error matrix to separate the error sequence from the original sequence. The LSTM-AE-ADVanced (LSTM-AE-ADV) method proposed by Kieu T et al. suggested that using static methods to enrich datasets and feed them into AE could achieve better results [
22]. Geiger et al. [
23] proposed the TadGAN method, an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs). It uses LSTM as a base model for generators and critics to capture the temporal correlations of time series distributions.
In this paper, one TSAD method for anomaly detection of liquid level based on error generation and forecasting, called Forecasting and Error Generation AutoEncoder (FEG-AE), is proposed, which integrates the advantages of RAE and the sliding window technique to accelerate the training process. The experiment results show that FEG-AE can achieve superior performance and robustness in TSAD.
The key contributions of this paper are as follows:
Propose a new TSAD architecture for anomaly detection of liquid level in mold that separates the time series into a normal sequence and an anomaly sequence to address the anomaly feature in the liquid level in mold during the manual control phase. Such architecture is easy to train;
Introduce a new dynamic threshold method to score the TSAD based on the proposed method;
Evaluation is conducted on the production dataset and demonstrates that the proposed method outperforms four other baselines on the tested dataset.
4. Methodology
The overall process of the proposed method, FEG-AE, is shown in
Figure 2.
Liquid level in mold sequence data is preprocessed to differential sequence, then a clean series forecasting network is used to reconstruct the normal data, and an error extraction network is used to extract the abnormal data in the series. Anomalies are determined by comparing reconstructed normal and original data using a dynamic threshold method. Both networks are trained using a joint training method. Such a method allows the error extraction network to extract anomalies from the differential sequence in the early training stage, preventing the clean series forecasting network’s training process from being affected by abnormal data.
4.1. Original Issues
The continuous casting machine is not yet activated during the manual control phase. An error in stopper operation can cause anomalies in the liquid level in the mold that remain until the solidified metal starts being withdrawn. The anomaly sequence is relatively longer than other regular time series data. And the anomaly area shares similar features as normal areas because the abnormal liquid level can be viewed as normal at other times of the same casting process or under casting conditions. Thus, applying an unsupervised learning method like traditional RNN-AE results in a high false-positive rate and fails to capture errors in the casting dataset. Therefore, a more robust approach is needed to detect anomalies in the liquid level in mold.
The RAE framework proposed by Tung Kieu et al. [
20] separated the anomaly sequence from a normal sequence. Although it does make some improvements in time series detection, it does not solve the overfitting issue. The method also requires a set of normal time series data, which requires artificial classification of the training data. While RAE is trained on normal data, the issue still occurs due to anomaly data that has a similar pattern to the anomaly-free data displayed in
Figure 3. Unfortunately, the similarity in pattern allows overtrained AE to reconstruct anomaly data correctly.
These issues can be solved by applying the following methods:
Introduce an error-generating model to generate an error sequence instead of directly initializing an error time series as a time series filled with 0, then update it during training. The method can make the model more flexible compared to RAE and RDAE models. Sliding windows and mini-batched training can be used in such a process;
Use a forecasting-based sequence generation model to generate the time sequence, which can avoid the overfitting problem shown in
Figure 3;
Combine the forecasting network with the error extraction network. Such a method allows the detector to consider evaluating sequences’ previous sequences while avoiding the overfitting problem.
4.2. Preprocess
To highlight the anomaly in the data, we use a liquid level differential sequence that represents level changes to replace the liquid level sequence. Such action can significantly shorten the abnormal interval and prevent unsupervised learning methods from overfitting.
Figure 4 shows the liquid level simulation results. A stopper operation mistake causes the liquid level to go above its’ normal level for a long duration. Converting the liquid level sequence to a differential sequence shortens the duration of the anomaly sequence. However, the abnormal part still has the same trend as the normal part, and it is still easy to be mistaken for normal data when using AE.
4.3. Architecture
The whole architecture is shown in
Figure 5. The architecture detects anomalous sequences by using information gathered from its current under-detection and previous sequences.
A sliding window
with size
contains data from
th data to
th data in a time series. A sliding window
contains first
data in
. Sliding window
with size
contains last
data in
.
, and
are defined as follows:
Sliding window
contains a time subsequence
. The forecasting network takes the first
data in
as input and forecasts the remaining time series
, where
is the dataset size, and
is the feature size. The error extraction network takes the entire
sequence as input and then outputs the error series
and an anomaly-free time sequence
.
,
,
, and a reconstructed version of the original time series
are defined as follows:
The forecasting network and error extraction network can be implemented by using RNN. The two networks can be defined as follows:
The losses
,
and
of the proposed framework are designed as follows.
is defined by DIstortion Loss including shApe and TimE (DILATE) [
24] and MSE loss. DILATE is a loss function design for time series data. It uses Soft Dynamic Time Warping (SoftDTW) to define shape loss and Time Distortion Index (TDI) to define temporal losses. MSE is also used to accelerate the fitting process.
Equation defines the total loss for the entire framework. This equation makes sure that generated and are decomposed from . Minimizing minimize the error sequence to make sure that the error sequence stays sparse. An appropriate value makes sure that item decrease if there is no error in a sliding window sequence. Equation (11) defines the forecasting loss, and Equation (10) defines the error extraction loss.
The optimization target for the framework is defined in Equation (12). Parameter
and
need to be updated during the training process to update the forecasting and error extraction networks.
The proposed method uses three static features: target level, steel width, and thickness. A vector combines these three static features with FEG-AE output and is fed into a fully connected layer.
One issue before detecting the entire time series is detecting the first sequence captured by the sliding window. A backward-directional forecasting network and an error extraction network are trained to resolve the issue. Therefore, another sliding window setup is not needed. By using the current sliding window methods, it is able to use the first time series data to forecast the following data and use the last data to generate the previous data.
4.4. Train the Model
The training algorithm of FEG-AE is shown in Algorithm 1.
Algorithm 1 FEG-AE. |
Input: Time series captured by sliding window |
Output: |
repeat |
For every window |
|
|
|
|
Update both and by minimizing . |
//Update the forecasting network. |
|
|
|
Update by minimizing . |
//Update the error extraction network. |
|
|
|
Update by minimizing . |
Until |
The algorithm first updates the parameters of the forecasting network and the error extraction network together. Then each model is updated separately.
4.5. Dynamic Threshold
Using a fixed threshold to identify anomalies in the forecasting outputs usually results in many false positives in the anomaly-free sequence. The forecasting network can capture the general trend of the time series data but still has a high deviation value in the anomaly-free part if
is used to calculate deviation. The issue is that the original data is increasing too fast, and the forecasted data is not sharp enough to catch up. Due to the high slope in those areas, a slight differentiation results in a high deviation value, as shown in
Figure 6b. Hence, a new way to calculate the threshold and identify the error is introduced to achieve higher precision in the anomaly detection of liquid level in mold.
The standard deviation and mean are used to calculate the threshold for every sliding window, marked as
and
. Dynamic thresholds
are defined as follows:
The first item
is the threshold that controls the deviation. The second item controls the offset threshold.
Figure 6b shows that the threshold changes according to the window. Compared to the fixed threshold RAE uses, a dynamic threshold guarantees a lower false-positive rate without using post-processing.
The Euclidean distance is used to calculate the differences between the forecasted sequence and the actual sequence in a sliding window.
An anomaly is found if
is higher than the compared window’s threshold, meaning the distance deviation between the forecasted and original values is larger than the threshold. The method can identify the anomaly more precisely. A visual comparison between the fixed and dynamic thresholds is shown in
Figure 6.
5. Experiment Results
5.1. Experiment Setup
Dataset. The experiment uses a dataset collected from the casting process to evaluate the proposed framework and compare experiment results with other methods. Part of the data in this dataset is shown in
Figure 7. The dataset contains information about liquid levels in mold (measured in cm) captured by the sensor.
Architecture. In the experiment, the forecasting network is a GRU-AE model. The encoder part is a 20-dimensional latent space GRU with two hidden layers, followed by a linear function as a Fully Connected (FC) layer. The decoder part is a one-to-many GRU, followed by an FC layer that combines static and temporal features. No dropout is applied. The error extraction network uses a Seq2Seq model with a 20-dimensional latent space and two hidden layers. The sliding window size is set to 10 and is set to .
Baselines. The experiment compares the proposed framework with several popular and state-of-the-art methods as the baseline.
GRU-AE [
25]: GRU trains faster and performs better than LSTM on a smaller dataset. [
26]. The decoder’s and encoder’s GRU layers both have two hidden layers. Encoder GRU outputs 10-D-encoded data and then passes it through an FC layer. The decoding process takes the encoded 10-D data and passes it through the decoder GRU and an FC layer. Drop-out is applied for both encoder and decoder GRU;
RAE [
20]: Use LSTM-AE as an anomaly detector. The hyperparameter λ is set to
which has the best experiment result in their practice. A sliding window of size 10 is used to evaluate the anomaly score;
TadGAN [
23]: A sliding window of size 10 is used to calculate the area difference for reconstruction error;
LSTM-AE-Advanced (LSTM-AE-ADV) [
22]: use a sliding window size of 4 to perform the enrichment time series process.
All methods above use a fixed threshold group {0.3, 0.2, 0.1, 0.09, 0.07, 0.05, 0.03, and 0.01}. For every baseline method, a score is calculated for each threshold. The highest score is recorded in the process. A sliding window of size 10 is used to calculate F1 scores. Except for RAE, all other methods are implemented with sliding windows of size 10. The evaluation is also using a sliding window of the same size.
All baseline methods are implemented in Python 3.7 with the Pytorch 1.7.1 library [
27] and CUDA 11.0. The Adam optimizer [
28] is used to update the frameworks.
5.2. Score Metric
The conventional metrics of precision, recall, and F1-Score are used to assess the performance of various methods. The preferred outcome for end-users is to obtain prompt and precise alarms with few false positives (FP), which may consume time and resources. The following window-based rules are implemented to discourage excessive FPs and encourage prompt and precise alarms: A true positive (TP) is recorded if a predicted window overlaps with a labeled anomalous window. A false negative (FN) is recorded if a forecasted window does not overlap with a labeled anomalous window. An FP is recorded if a labeled anomalous window does not overlap with a forecasted window. This method is also adopted in Hundman’s method [
29] and Alexander Geiger’s method [
23].
5.3. Experiment Results
5.3.1. Comparison Experiment
The experiment results are displayed in
Table 1. The results show that the F1 score is significantly higher when the FEG-AE method is used on actual production datasets. The cause is the proposed model’s ability to identify error sequences with lower false negatives and higher precision. In other AE-based methods, the lower score is mainly caused by overfitting. The experiment also shows that the proposed method has an overall slightly lower recall rate compared to the other methods. The dynamic threshold method can greatly improve the precision, but in some extreme cases, it will produce an extremely high threshold in a few abnormal areas, on which FEG-AE could produce more FNs.
5.3.2. Ablation Experiment
The ablation experiment separates three key components of the FEG-AE methods and then uses them separately. The result is shown in
Table 2. Using a forecasting network to reconstruct the liquid level in mold sequence can greatly improve precision. But because anomalies last a long time, anomaly data occupies a relatively large portion of all data. During unsupervised training, the forecasting network does not know which data is normal. Therefore, if an error repeatedly appears in the training set, it will still mistake it for normal data. The error extraction network acts as a regular LSTM-AE. It achieves a super low precision of 0.111 by mistaking almost every abnormal data point for normal data. Combining the forecasting and error extraction networks and training them using the joint training algorithm in Algorithm 1 significantly improves the precision and F1 score. An error extraction network can extract anomaly data from the forecasting network’s training set in the early training stage, therefore keeping the forecasting network free from anomaly data pollution. The dynamic threshold further improves precision and causes high thresholds in some areas. Therefore, the dynamic threshold increases the FN count slightly and thus results in a slightly lower recall.
Figure 8 demonstrates the training speed when using single-batch and mini-batch methods. The results show that applying sliding windows and mini-batch training can fully utilize the GPU’s power and achieve better training speeds.
5.3.3. Parameter Experiment
The experiment results in
Table 3 show the F1 score when different dynamic thresholding settings are used. In the experiment,
and
can achieve the highest F1 score on the actual production dataset.
Figure 9 shows how
affects the forecasting result and the error extraction output results.
Figure 10 shows the F1 result using different
.
The experiment shows that a smaller value of results in underfitting. Because the smaller is, less data is needed to reconstruct the normal sequence. But meanwhile, the error extraction network uses the entire data to generate error sequences. The error extraction network will be trained faster than the forecasting network during training. Therefore, it is more likely to mistake the entire series as an anomaly to reduce the loss in the early stages. By contrast, a larger value results in error extraction network underfitting. In the experiment, setting and achieves the best F1 result on the tested dataset.
6. Conclusions
The traditional AE methods failed to identify anomalies in the liquid level in mold time series data. The proposed method, FEG-AE, is inspired by RAE and decomposes a time series of liquid level in a mold into a normal time sequence and an anomaly sequence. A forecasting network to reconstruct the normal sequence using the previous sequence’s features. An error extraction network to extract error from the free forecasting network from anomalies’ pollution improves precision compared to the traditional AE. A dynamic threshold method is proposed to identify the anomaly with higher precision. Compared to a fixed threshold, the proposed method significantly increases precision but slightly reduces recall. FEG-AE achieves the highest precision of 0.895 and an F1 score of 0.872 compared to baseline methods. The dynamic threshold method can significantly improve precision and F1 but has a minor recall reduction due to higher thresholds in some windows, resulting in more FNs.
Future work will investigate a more robust way to balance the forecasting network and error extraction network to reduce the effects of the hyperparameter . Also, a more robust dynamic threshold method is under investigation to improve recall.