1. Introduction
Blast Furnace (BF) ironmaking is a critical part of the steel manufacturing process, contributing to over 60% of the production cost of steel products [
1]. The hot blast stove is equipped to generate and deliver high-temperature hot air to the BF to meet the thermal requirements of the iron ore reduction [
2]. Implementing intelligent combustion of the hot blast stove can bring many benefits, such as increasing blast temperature and reducing energy consumption.
The hot blast stove’s combustion process is extremely complex due to its nonlinearity, sluggish time changing, and hysteresis properties [
3,
4]. Currently, three types of control methods are widely adopted:
- (i)
Traditional control methods [
4,
5,
6], such as air/gas cascade control and crossing limit control. Such methods use proportional–integral–derivative (PID) strategies used to control the air/gas adjustment, which are easy to implement. However, it is hard to use classical and general PID strategies for the complex combustion control of the hot blast stove, which suffers from the drawbacks of control lag and excessive control intensity.
- (ii)
Mathematical modeling methods [
7,
8,
9]. Such methods aim to establish the mathematical model based on the heat balance of the hot blast furnace during the combustion process. Thus, the optimal control decision is self-adaptively made according to this mathematical model. However, it is hard to design accurate mathematical models. Moreover, they generally require a large number of thermal parameters, and the investment in collecting these parameters is huge.
- (iii)
Artificial intelligence methods [
10,
11,
12], including expert systems and various meta-heuristic algorithms. Such methods use intelligent knowledge to self-adaptively design optimization controllers, becoming one of the most promising control alternatives. However, all existing artificial intelligence control methods require air and gas flow meters to accurately calculate the control parameters, such as the air-to-gas ratio. If these instruments malfunction, the intelligent systems will become paralyzed.
To address these challenges, this study proposes a novel RL-based intelligent combustion control method. This method allows autonomous learning of the implicit relationship between the combustion state and valve adjustment. Considering the difficulty in extracting the features of the combustion state, we designed and evaluated five RL models, each implemented by a different deep embedding network. Finally, to validate the effectiveness of the method, the proposed intelligent combustion control system is deployed in the no. 1 hot blast stove of the no. 7 BF (1750 m3 in volume) at Tranvic Steel Co., Ltd. in China.
By comparing with existing methods, the highlights of the proposed method are as follows:
Our method neither requires air and gas flow meters nor explicit modeling between the valve positions and the measured values of the above instruments. It can use the RL model to autonomously learn the implicit relationship between the combustion state and valve position adjustment parameters.
The historical combustion data of a hot blast stove can be utilized for offline training to obtain expert knowledge. Avoiding interactive trial-and-error in real environments can greatly improve training efficiency and reduce the costs associated with trial-and-error.
2. Formal Model of the Combustion Control Problem
Reinforcement learning is a branch of machine learning that originated in optimal control theory and focuses on sequential decision problems. The ultimate goal of RL is to find an optimal policy that maximizes the cumulative expected return of the decision-making process. This is achieved by continuously interacting with the environment and adjusting the learning agent’s policy for action selection. RL can automatically acquire rules, which solves the challenge of rule acquisition in traditional intelligent optimization methods. Moreover, it offers advantages, such as rapid problem-solving speed and a strong generalization capability.
Before delving into our RL-based problem model, we need to formalize the combustion control process of the hot blast stove. As shown in
Figure 1, the operator can adjust the gas/air valves every
seconds, which aligns with the interval of state data collection from the hot blast stove. At each time step
t, sensor data
are collected and stored in the database. The operator then adjusts the valves (deemed as action
) based on the accumulated combustion information from the current combustion cycle, denoted as state
. After
seconds, i.e., at time
, the operator receives data reflecting the new state
of the hot blast stove, along with the reward
that mirrors the control performance. In response to the new state
, the operator performs the new action
. The above process is repeated until the end of the combustion cycle.
It has been shown that the combustion control process of a hot blast stove conforms to a typical RL-based decision-making process [
13]. In this process, the operator is replaced with a decision-making agent, and the agent outputs the actions of combustion controls according to its policy in order to meet the combustion optimization needs. To complete the formulation of the problem we address, the states, actions, and rewards of the combustion control process are formalized as follows, so as to guide the agent in constructing its policy.
State: A state is a depiction of the current environment. In the combustion optimization scenario, the state should reflect the combustion circumstances of the hot blast stove at a particular time. So, in this study, we chose five informative and easy-to-obtain parameters: gas pressure, air pressure, position of the air valve, position of the gas valve, and dome temperature. Moreover, we adopted a moving time window strategy to handle the hot blast stove’s continuous and lagging combustion issue. Instead of only taking the present data each time, our strategy entails selecting continuous data within a time window of size to construct a state. Accordingly, the state is formulated as follows: , where ( in this study) denotes the data recorded at time t. Note that for to , which lack data from the previous segment, we fill in with 0.
Figure 2 presents the procedure for state data generation. The figure’s upper section illustrates the raw data gathered from sensors. Each column matches data collected at a particular time step. Take
, which denotes the combustion data collected at time
t, as an example. These data are comprised of five distinct scalars that correspond to variables such as gas pressure, air pressure, and others. The lower half of the figure displays the reinforcement learning model’s state representation. The state at time
t comprises records from time
through time
t. We can generate states at different moments within a combustion cycle using a similar approach.
Action: When the agent receives the state of the environment, it must choose an action to execute based on its policy , which results in a reward signal. In the context of combustion control for a hot blast stove, the agent must decide the appropriate valve position and adjust it. In this paper, we adopted a classical strategy of maintaining the air valve’s position while regulating the gas valve’s position. It is challenging to learn the precise position by the model directly. Thus, we construct a discrete action space by deciding the adjustment direction of the valve. Consequently, the action space is represented by .
Reward: A reward signal is a scalar number that defines the quality of an agent’s actions; it is sent to the agent immediately after each action is performed. By utilizing the reward, the agent can adjust its policy to maximize the total reward over the long run [
14]. Consider that combustion optimization control aims to rapidly increase the dome temperature and maintain a high level throughout the combustion cycle. Hence, the reward at time
t is calculated as the difference between the dome temperature values at times
and
t. This is expressed as
, where
and
represent the dome temperatures at time
and time
t, respectively.
Experience: Whenever an action (
) is chosen in state (
), a corresponding reward (
) is received as feedback, causing the entire system to transition from the state
to the next state
. To keep track of the changes in each metric during the agent–environment interaction, the changes are recorded via a tuple
, which we call experience. As part of the training process, historical combustion data are processed as described above, forming multiple experiences. These experiences are then stored in a prioritized experience replay buffer [
15]. Subsequently, the training procedure retrieves experiences from this replay buffer as training samples, aiming to optimize the agent’s policy.
Objective of the model: For each state, the agent must take an action according to its policy and receive a corresponding reward. Furthermore, due to the stochastic nature of combustion in the hot blast stove, the states and reward signals vary over different combustion cycles. We want the policy of the agent to be robust and generalized enough. This means that the agent’s performance should be optimized and evaluated through multiple complete combination cycles. In summary, the objective of the RL model is to construct a policy that maximizes the expected return, as shown in Equation (
1):
where
is the return averaged over many episodes, and each episode denotes a complete combination cycle.
N is the total number of time steps in an episode.
represents a state, action, and reward sequence from time 1 to time
N.
represents the reward received when transitioning from state
to the next state
.
is the discount factor, which measures the importance of future rewards. A larger discount factor will result in a more far-sighted agent.
3. Deep Reinforcement Learning Models for Problem-Solving
In this section, we first provide an introduction to the DQN that applies to solving the problem. Next, we explain the structure of the DQN and the loss function employed to optimize its parameters. Lastly, the training procedure is presented.
3.1. DQN for Optimizing Hot Blast Stove Control
Optimizing the hot blast stove control necessitates consideration of the entire combustion cycle. To maximize the objective, the agent must perform the optimal action at each time step. However, greedily selecting the most favorable control action currently will not guarantee that the entire combustion cycle will be optimally controlled. To achieve the optimal control aim, we employ a quantitative measure to determine the effectiveness of control actions throughout the combustion cycle via the Q-value. The Q-value reflects the return that can be obtained by performing a specific action in the current state. It guides the intelligent agent toward selecting the most beneficial long-term action.
As shown in Equation (
2), a Q-value is associated with a state–action pair
, which measures the expected discounted return when the action
a is taken in the state
s.
Here,
, … is the sequence of rewards that can be obtained by interacting with the environment. However, in real-world industrial applications, the reward sequence is unknown since the subsequent actions do not arise. Therefore, it is necessary to estimate the Q-value. To achieve this, we adopt a reinforcement learning approach based on a deep Q-network (DQN) [
14,
16]. By pre-training DQN with experience samples, this method can accurately approximate the Q-value without actually executing control actions. Also, by harnessing the capabilities of neural networks, it can accurately derive the estimation of Q-values.
Conventionally, the Q-function used to compute Q-values is approximated and optimized through extensive trial and error. However, the interactive trial and error in real-world industrial environments is expensive and time-consuming [
17]. In order to mitigate the risks of trial-and-error exploration of the agents in real-world environments, this paper uses historical combustion data to train the RL model offline, where both temporal difference loss and supervised learning loss are considered.
3.2. Structure of DQN
Our network structure is shown in
Figure 3. To estimate the Q-value accurately and efficiently, our network can be mainly divided into two parts: the embedding network
f and the multilayer perceptron network (MLP)
g. On the one hand, the embedding network
f focuses on capturing the dynamic changes in the hot blast stove, thereby obtaining the embedding vector that serves as an abstract representation of the state. On the other hand, the MLP
g maps the embedding vector into Q-values, which correspond to the actions for combustion control in the hot blast stove.
Note that, in our model, the embedding network is essential for addressing the challenge of learning state feature representation. Simply put, the state descriptions of hot blast stoves vary across different application settings due to discrepancies in monitoring data collected by various companies. These inconsistencies include variations in data collection time intervals, working parameters, and data types. We aim to improve the learning process for hot blast stove combustion control by structuring and modularizing it through embedding. This approach entails utilizing an embedding vector with predetermined dimensions to signify the state of the hot blast stove. In addition, the neural network can automatically learn the aforementioned embedding vector, resulting in improved learning of the state and a more accurate estimation of Q-values.
We developed three types of embedding networks for this purpose. Below, we offer a comprehensive explanation of these embedding networks.
3.2.1. RNN-Based Embedding Network
The data collected during the control of combustion within a hot blast stove exhibits typical sequential characteristics. To effectively process this sequential data, we initially designed an embedding network based on RNN [
18]. The RNN was originally designed for processing sequential data and is suitable for tasks such as time-series forecasting, natural language processing, and speech recognition. It can scan data sequentially, preserving vital information through the use of a context vector. The context vector is updated and propagated along with the elements within the input sequence. Among the RNN variants, the gated recurrent unit (GRU) [
19] is particularly suitable as it effectively addresses the vanishing–exploding gradient problem commonly encountered in basic RNNs.
Figure 4 shows our RNN-based embedding network, which utilizes GRU to extract the state features of the combustion control process.
For each state input sequence
, we can use a gated recurrent unit with parameters
to output a list of hidden states
, where
d is the number of hidden features, each of which is calculated by Equation (
3).
Here, GRU(·) is the standard operation of the gated recurrent unit. Each output hidden state
contains the information from time
to
of the combustion cycle.
denotes the neural network parameters learned through model training. We can flatten all hidden states as the embedding of the input state, as shown in Equation (
4).
Alternatively, we can use the final context vector,
, directly as an embedding of the input state, as shown in Equation (
5).
3.2.2. Attention-Based Embedding Network
Attention can be seen as a process of adjusting weights dynamically across all input tokens based on data features. This process has been extensively utilized in sequence process models, especially in natural language processing models [
20,
21].
Figure 5 shows our attention-based embedding network. To begin, the state data
are input into a linear layer, producing
with dimension
d in Equation (
6). Then, the learnable positional encoding
C, inspired by [
22], is added to
to obtain
in Equation (
7). Following this, two linear operations allow for the mapping of the hidden state matrix
into a key matrix,
, and a value matrix,
, respectively, as shown in Equations (
8) and (
9).
where
and
are network parameters learned through model training.
The attention score of each element in the sequence can be computed in Equation (
10).
where the query
denotes network parameters learned through model training, and
is a scalar value obtained by the Softmax function. Finally, the output of the embedding network can be obtained in Equation (
11).
3.2.3. CNN-Based Embedding Network
CNN has fruitfully been applied to a diverse array of tasks, such as image processing and time-series forecasting. The elements comprising the output of a CNN layer can be viewed as weighted aggregations of values from their spatial neighbors. These weighting schemes are referred to as kernels or filters. By sequentially scanning across the temporal order of the sequence via such kernels, CNN possesses the ability to capture the intricate temporal relationships within the data [
23]. Leveraging this capability, we designed a CNN-based embedding network. As shown in
Figure 6, the network consists of several one-dimensional convolutional layers and a flattening layer. An activation function (e.g., rectified linear unit (ReLU)) is inserted between every two convolutional layers for nonlinear mapping. As for the input state data
, the embedding of
is obtained through several Conv-ReLU layers and a flattening layer, whose calculations are shown by Equations (
12) and (
13).
where
represents the output of the
l-th layer of the convolutional network, initially
.
L denotes the total number of convolutional layers.
denotes the network parameters of the
l-th layer of the convolutional network learned through model training.
denotes the embedding of the input state
. Conv(·), ReLU(·), and Flatten(·) represent the standard convolution, activation, and flatten operations, respectively.
To summarize, to find the most suitable network for our problem, we propose five models for implementing the RL with different embedding network structures. Specifically, the proposed models encompass (1) RNN2-MLP, which uses GRU to obtain all hidden states
as embedding vectors for the MLP; (2) RNN2C-MLP, which is similar to RNN2-MLP, with the difference being that the last hidden state
is used as the embedding vector and fed into the MLP; (3) Attention-MLP, which uses the multi-head attention mechanism as the embedding network; (4) Conv3-MLP, which has a three-layer convolutional neural network as the embedding network; and (5) MLP, a model that does not consider the embedding network, i.e., it inputs the state directly into the multilayer perceptron. In
Section 4, we will evaluate the performances of the different models.
3.3. Loss Function
The object of training our RL-based model is twofold. First, we aim to ensure that the output Q-values of the deep Q-network approximate the expected return. Secondly, we aim to mimic the expert knowledge from the historical combustion data to learn reasonable network parameters for combustion decision-making. To meet these requirements, we design a hybrid loss function as follows.
3.3.1. Temporal Difference Loss
The Q-function can be approximated by learning from experiences, which indirectly shapes the policy. For the experience tuple
, we can estimate the Q-value of the state–action pair
using the Q-network with parameters
, which we denote as
. Alongside the estimated Q-value, we construct an update target
using experience tuples spanning from
to
. This is formulated according to Equation (
14). This specific target reflects a more precise Q-value of state–action pair
. Finally, the parameters
are optimized. This optimization aims to minimize the discrepancy between the estimated Q-values and the previously mentioned target.
Here, are the n-step rewards from state to state . is the discount factor. represents the Q-value of the state–action pair , estimated by the target network. The target network shares the same architecture as the Q-network but has different parameters, denoted as . These parameters are periodically synced with the parameters of the Q-network, aiding in stabilizing the training process. corresponds to the action with the highest Q-value, as estimated by the Q-network. A is the set of actions.
To make a more smoothed absolute error and avoid excessive parameter updates, the n-step temporal difference (TD) loss
is given in Equation (
16). Here,
is called the n-step TD error.
3.3.2. Large Margin Classification Loss
Note that historical combustion data are utilized to avoid direct learning from real industrial environments with high trial-and-error costs. However, relying solely on TD loss is not sufficient. This is because the Q-values of actions absent from the training data may be inconsistent with the actual scenario. To alleviate the issue of counterfactual queries [
24,
25], a supervised learning loss (denoted as the large margin classification loss) is designed. This loss, as formulated in Equation (
17), prompts the agent to approximate the policy that is implicitly derived from the historical data. It achieves this by lowering the Q-values of the actions not selected in the historical data.
Here, is the large margin classification loss. is the action performed for state in the historical combustion data. is the classification margin threshold.
3.3.3. L2 Regularization Loss
To avoid overfitting, the regularization loss is also implemented, as shown in Equation (
19).
This regularization term restricts the diversity of model parameters, resulting in enhanced generalization capabilities of the model.
3.3.4. Combined Losses
Overall, our loss function is the weighted sum of the three mentioned above, as shown in Equation (
20).
where
,
, and
are the weights of the losses.
3.4. Training and Inference
Although we designed several RL models with different embedding networks, the ultimate objective remains the same—to minimize the loss of the neural network. This can be achieved through training using gradient descent methods. Since the training process is the same, we use a uniform algorithm to describe it.
Algorithm 1 details the training process of our RL models. After initialization (lines 1–5), a batch of experiences is sampled based on their probability to prioritize vital experiences (lines 7–8). For each experience, we compute the gradient of the loss function and update its sampling priority using the temporal difference error (lines 10–12). The parameters of the Q-network undergo an update through gradient descent to minimize the loss function (lines 14–15). Upon reaching the predetermined frequency of updating the target network parameters,
, the parameters of the Q-network are synchronized with the target network (lines 16–18).
Algorithm 1: Training the RL model for combustion optimization control of the hot blast stove. |
|
After training, the model exhibits decision-making abilities comparable to the ones observed in historical combustion data. This achievement enables its deployment in a production environment. Finally, real-time state data during the online combustion control process are collected and fed into the model, which outputs the required adjustment direction for the gas valve. Based on the preset adjustment step size, the valve’s adjustment amount can be computed, enabling real-time control.
Note that our model optimizes the TD loss and the supervised learning loss simultaneously, which can improve the policy learning efficiency by absorbing expert experience in the historical combustion data. Moreover, by circumventing the direct interaction with the production environment, our model significantly reduces the trial-and-error cost and provides the possibility of engineering applications.
5. Conclusions
Combustion optimization of hot blast stoves is critical for reducing steel production costs and conserving energy. However, it can be challenging to apply existing artificial intelligence technologies when flow meters are absent or malfunctioning. To address this challenge, we propose an intelligent combustion control method for hot blast stoves based on reinforcement learning. The main categories of this article are summarized as follows.
(1) The formal model of combustion optimization for a hot blast stove is established, where the states, actions, and rewards of the combustion control process are formalized to guide the reinforcement agents in formulating their policy.
(2) Five RL models (using different deep embedding networks) are implemented to address the difficulty of learning state feature representation. The Attention-MLP-based RL model is distinguished through experimental testing; its accuracy reaches 85.91% and the average inference time is only 4.85 ms.
(3) Our system runs in the no. 1 hot blast stove of the no. 7 BF (1750 m3 in volume) at Tranvic Steel Co., Ltd. The results demonstrate that the system can autonomously learn the implicit relationship between the combustion state and the valve adjustment actions. It performed well even when the flow meters malfunctioned.
Our study may have a few potential areas for refinement. Firstly, this study adopted a strategy of maintaining the position of the air valve while regulating the gas valve. The control action of our model indicates the adjustment direction with simple rules for obtaining the adjustment amount. In future research, we will aim to improve control precision by refining control granularity, such as establishing the exact valve position adjustment values and integrating air valve position control. Secondly, the models we trained are only limited to the hot blast stove where the training data are collected. So, we will explore the possibility of migrating the control policy across different hot blast stoves. This measure has the potential to increase the usability of the model and benefit new hot blast stoves that lack historical combustion data. Despite these limitations, our work is innovative and promising compared to existing research. We will conduct further research to address these limitations.