Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach

Yang, Taoran; Guo, Hongwei; Liang, Helan; Yan, Bingji

doi:10.3390/pr11113140

Open AccessArticle

Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach

¹

School of Computer Science & Technology, Soochow University, Suzhou 215006, China

²

School of Iron and Steel, Soochow University, Suzhou 215021, China

^*

Authors to whom correspondence should be addressed.

Processes 2023, 11(11), 3140; https://doi.org/10.3390/pr11113140

Submission received: 7 October 2023 / Revised: 26 October 2023 / Accepted: 30 October 2023 / Published: 2 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Combustion optimization of hot blast stoves is a promising approach for cost savings and energy conservation of ironmaking. Existing artificial intelligence methods for this optimization rely on air and gas flow meters, which can malfunction under harsh working conditions. To meet this challenge, we propose an intelligent combustion control system based on reinforcement learning (RL). Considering the difficulty of learning state feature representation, five RL models using different deep embedding networks were implemented and evaluated. The Attention-MLP-based RL model is distinguished through experimental testing, achieving an accuracy of 85.91% and an average inference time of 4.85 ms. Finally, the intelligent combustion control system with the Attention-MLP-based RL model runs in the hot blast stove of the blast furnace (1750 m³ in volume) at Tranvic Steel Co., Ltd. in China (Chengdu, China). The results show that our system can achieve good control performance by autonomously learning the implicit relationship between the state of the hot blast stove and the valve control action in the absence of flow meters.

Keywords:

blast furnace; hot blast stove; combustion control; reinforcement learning; embedding network

1. Introduction

Blast Furnace (BF) ironmaking is a critical part of the steel manufacturing process, contributing to over 60% of the production cost of steel products [1]. The hot blast stove is equipped to generate and deliver high-temperature hot air to the BF to meet the thermal requirements of the iron ore reduction [2]. Implementing intelligent combustion of the hot blast stove can bring many benefits, such as increasing blast temperature and reducing energy consumption.

The hot blast stove’s combustion process is extremely complex due to its nonlinearity, sluggish time changing, and hysteresis properties [3,4]. Currently, three types of control methods are widely adopted:

(i): Traditional control methods [4,5,6], such as air/gas cascade control and crossing limit control. Such methods use proportional–integral–derivative (PID) strategies used to control the air/gas adjustment, which are easy to implement. However, it is hard to use classical and general PID strategies for the complex combustion control of the hot blast stove, which suffers from the drawbacks of control lag and excessive control intensity.
(ii): Mathematical modeling methods [7,8,9]. Such methods aim to establish the mathematical model based on the heat balance of the hot blast furnace during the combustion process. Thus, the optimal control decision is self-adaptively made according to this mathematical model. However, it is hard to design accurate mathematical models. Moreover, they generally require a large number of thermal parameters, and the investment in collecting these parameters is huge.
(iii): Artificial intelligence methods [10,11,12], including expert systems and various meta-heuristic algorithms. Such methods use intelligent knowledge to self-adaptively design optimization controllers, becoming one of the most promising control alternatives. However, all existing artificial intelligence control methods require air and gas flow meters to accurately calculate the control parameters, such as the air-to-gas ratio. If these instruments malfunction, the intelligent systems will become paralyzed.

To address these challenges, this study proposes a novel RL-based intelligent combustion control method. This method allows autonomous learning of the implicit relationship between the combustion state and valve adjustment. Considering the difficulty in extracting the features of the combustion state, we designed and evaluated five RL models, each implemented by a different deep embedding network. Finally, to validate the effectiveness of the method, the proposed intelligent combustion control system is deployed in the no. 1 hot blast stove of the no. 7 BF (1750 m³ in volume) at Tranvic Steel Co., Ltd. in China.

By comparing with existing methods, the highlights of the proposed method are as follows:

Our method neither requires air and gas flow meters nor explicit modeling between the valve positions and the measured values of the above instruments. It can use the RL model to autonomously learn the implicit relationship between the combustion state and valve position adjustment parameters.
The historical combustion data of a hot blast stove can be utilized for offline training to obtain expert knowledge. Avoiding interactive trial-and-error in real environments can greatly improve training efficiency and reduce the costs associated with trial-and-error.

2. Formal Model of the Combustion Control Problem

Reinforcement learning is a branch of machine learning that originated in optimal control theory and focuses on sequential decision problems. The ultimate goal of RL is to find an optimal policy that maximizes the cumulative expected return of the decision-making process. This is achieved by continuously interacting with the environment and adjusting the learning agent’s policy for action selection. RL can automatically acquire rules, which solves the challenge of rule acquisition in traditional intelligent optimization methods. Moreover, it offers advantages, such as rapid problem-solving speed and a strong generalization capability.

Before delving into our RL-based problem model, we need to formalize the combustion control process of the hot blast stove. As shown in Figure 1, the operator can adjust the gas/air valves every

μ

seconds, which aligns with the interval of state data collection from the hot blast stove. At each time step t, sensor data

x_{t}

are collected and stored in the database. The operator then adjusts the valves (deemed as action

a_{t}

) based on the accumulated combustion information from the current combustion cycle, denoted as state

s_{t} = {x_{1}, x_{2} \dots, x_{t}}

. After

μ

seconds, i.e., at time

t + 1

, the operator receives data reflecting the new state

s_{t + 1}

of the hot blast stove, along with the reward

r_{t}

that mirrors the control performance. In response to the new state

s_{t + 1}

, the operator performs the new action

a_{t + 1}

. The above process is repeated until the end of the combustion cycle.

It has been shown that the combustion control process of a hot blast stove conforms to a typical RL-based decision-making process [13]. In this process, the operator is replaced with a decision-making agent, and the agent outputs the actions of combustion controls according to its policy in order to meet the combustion optimization needs. To complete the formulation of the problem we address, the states, actions, and rewards of the combustion control process are formalized as follows, so as to guide the agent in constructing its policy.

State: A state is a depiction of the current environment. In the combustion optimization scenario, the state should reflect the combustion circumstances of the hot blast stove at a particular time. So, in this study, we chose five informative and easy-to-obtain parameters: gas pressure, air pressure, position of the air valve, position of the gas valve, and dome temperature. Moreover, we adopted a moving time window strategy to handle the hot blast stove’s continuous and lagging combustion issue. Instead of only taking the present data each time, our strategy entails selecting

ω

continuous data within a time window of size

ω

to construct a state. Accordingly, the state is formulated as follows:

s_{t} = X_{t - ω + 1 : t} = {x_{t - ω + 1}, x_{t - ω + 2}, \dots, x_{t}} \in R^{m \times ω}

, where

x_{t} \in R^{m}

(

m = 5

in this study) denotes the data recorded at time t. Note that for

s_{1}

to

s_{ω - 1}

, which lack data from the previous segment, we fill in with 0.

Figure 2 presents the procedure for state data generation. The figure’s upper section illustrates the raw data gathered from sensors. Each column matches data collected at a particular time step. Take

x_{t} = {(x_{t}^{1}, x_{t}^{2}, \dots, x_{t}^{5})}^{T}

, which denotes the combustion data collected at time t, as an example. These data are comprised of five distinct scalars that correspond to variables such as gas pressure, air pressure, and others. The lower half of the figure displays the reinforcement learning model’s state representation. The state at time t comprises records from time

t - ω + 1

through time t. We can generate states at different moments within a combustion cycle using a similar approach.

Action: When the agent receives the state of the environment, it must choose an action to execute based on its policy

π

, which results in a reward signal. In the context of combustion control for a hot blast stove, the agent must decide the appropriate valve position and adjust it. In this paper, we adopted a classical strategy of maintaining the air valve’s position while regulating the gas valve’s position. It is challenging to learn the precise position by the model directly. Thus, we construct a discrete action space by deciding the adjustment direction of the valve. Consequently, the action space is represented by

A = {Decrease the gas valve position : 0, Do nothing : 1, Increase the gas valve position : 2}

.

Reward: A reward signal is a scalar number that defines the quality of an agent’s actions; it is sent to the agent immediately after each action is performed. By utilizing the reward, the agent can adjust its policy to maximize the total reward over the long run [14]. Consider that combustion optimization control aims to rapidly increase the dome temperature and maintain a high level throughout the combustion cycle. Hence, the reward at time t is calculated as the difference between the dome temperature values at times

t + 1

and t. This is expressed as

r_{t} = T_{t + 1} - T_{t}

, where

T_{t + 1}

and

T_{t}

represent the dome temperatures at time

t + 1

and time t, respectively.

Experience: Whenever an action (

a_{t}

) is chosen in state (

s_{t}

), a corresponding reward (

r_{t}

) is received as feedback, causing the entire system to transition from the state

s_{t}

to the next state

s_{t + 1}

. To keep track of the changes in each metric during the agent–environment interaction, the changes are recorded via a tuple

e_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1})

, which we call experience. As part of the training process, historical combustion data are processed as described above, forming multiple experiences. These experiences are then stored in a prioritized experience replay buffer [15]. Subsequently, the training procedure retrieves experiences from this replay buffer as training samples, aiming to optimize the agent’s policy.

Objective of the model: For each state, the agent must take an action according to its policy and receive a corresponding reward. Furthermore, due to the stochastic nature of combustion in the hot blast stove, the states and reward signals vary over different combustion cycles. We want the policy of the agent to be robust and generalized enough. This means that the agent’s performance should be optimized and evaluated through multiple complete combination cycles. In summary, the objective of the RL model is to construct a policy that maximizes the expected return, as shown in Equation (1):

\max J (τ) = E_{τ} [\sum_{t = 1}^{N} γ^{t} r_{t}]

(1)

where

J (τ)

is the return averaged over many episodes, and each episode denotes a complete combination cycle. N is the total number of time steps in an episode.

τ = (s_{1}, a_{1}, r_{1}), (s_{2}, a_{2}, r_{2}), \dots, (s_{N}, a_{N}, r_{N})

represents a state, action, and reward sequence from time 1 to time N.

r_{t}

represents the reward received when transitioning from state

s_{t}

to the next state

s_{t + 1}

.

γ \in (0, 1)

is the discount factor, which measures the importance of future rewards. A larger discount factor will result in a more far-sighted agent.

3. Deep Reinforcement Learning Models for Problem-Solving

In this section, we first provide an introduction to the DQN that applies to solving the problem. Next, we explain the structure of the DQN and the loss function employed to optimize its parameters. Lastly, the training procedure is presented.

3.1. DQN for Optimizing Hot Blast Stove Control

Optimizing the hot blast stove control necessitates consideration of the entire combustion cycle. To maximize the objective, the agent must perform the optimal action at each time step. However, greedily selecting the most favorable control action currently will not guarantee that the entire combustion cycle will be optimally controlled. To achieve the optimal control aim, we employ a quantitative measure to determine the effectiveness of control actions throughout the combustion cycle via the Q-value. The Q-value reflects the return that can be obtained by performing a specific action in the current state. It guides the intelligent agent toward selecting the most beneficial long-term action.

As shown in Equation (2), a Q-value is associated with a state–action pair

(s, a)

, which measures the expected discounted return when the action a is taken in the state s.

Q_{(s, a)} = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k}| s_{t} = s, a_{t} = a]

(2)

Here,

r_{t}, r_{t + 1}, r_{t + 2}

, … is the sequence of rewards that can be obtained by interacting with the environment. However, in real-world industrial applications, the reward sequence is unknown since the subsequent actions do not arise. Therefore, it is necessary to estimate the Q-value. To achieve this, we adopt a reinforcement learning approach based on a deep Q-network (DQN) [14,16]. By pre-training DQN with experience samples, this method can accurately approximate the Q-value without actually executing control actions. Also, by harnessing the capabilities of neural networks, it can accurately derive the estimation of Q-values.

Conventionally, the Q-function used to compute Q-values is approximated and optimized through extensive trial and error. However, the interactive trial and error in real-world industrial environments is expensive and time-consuming [17]. In order to mitigate the risks of trial-and-error exploration of the agents in real-world environments, this paper uses historical combustion data to train the RL model offline, where both temporal difference loss and supervised learning loss are considered.

3.2. Structure of DQN

Our network structure is shown in Figure 3. To estimate the Q-value accurately and efficiently, our network can be mainly divided into two parts: the embedding network f and the multilayer perceptron network (MLP) g. On the one hand, the embedding network f focuses on capturing the dynamic changes in the hot blast stove, thereby obtaining the embedding vector that serves as an abstract representation of the state. On the other hand, the MLP g maps the embedding vector into Q-values, which correspond to the actions for combustion control in the hot blast stove.

Note that, in our model, the embedding network is essential for addressing the challenge of learning state feature representation. Simply put, the state descriptions of hot blast stoves vary across different application settings due to discrepancies in monitoring data collected by various companies. These inconsistencies include variations in data collection time intervals, working parameters, and data types. We aim to improve the learning process for hot blast stove combustion control by structuring and modularizing it through embedding. This approach entails utilizing an embedding vector with predetermined dimensions to signify the state of the hot blast stove. In addition, the neural network can automatically learn the aforementioned embedding vector, resulting in improved learning of the state and a more accurate estimation of Q-values.

We developed three types of embedding networks for this purpose. Below, we offer a comprehensive explanation of these embedding networks.

3.2.1. RNN-Based Embedding Network

The data collected during the control of combustion within a hot blast stove exhibits typical sequential characteristics. To effectively process this sequential data, we initially designed an embedding network based on RNN [18]. The RNN was originally designed for processing sequential data and is suitable for tasks such as time-series forecasting, natural language processing, and speech recognition. It can scan data sequentially, preserving vital information through the use of a context vector. The context vector is updated and propagated along with the elements within the input sequence. Among the RNN variants, the gated recurrent unit (GRU) [19] is particularly suitable as it effectively addresses the vanishing–exploding gradient problem commonly encountered in basic RNNs. Figure 4 shows our RNN-based embedding network, which utilizes GRU to extract the state features of the combustion control process.

For each state input sequence

s_{t} = X_{t - ω + 1 : t} = {x_{t - ω + 1}, x_{t - ω + 2}, \dots, x_{t}}

, we can use a gated recurrent unit with parameters

θ_{g}

to output a list of hidden states

H_{t} = {h_{t - ω + 1}, h_{t - ω + 2}, \dots, h_{t}} \in R^{d \times ω}

, where d is the number of hidden features, each of which is calculated by Equation (3).

h_{t - ω + i} = \{\begin{matrix} GRU (h_{t - ω + i - 1}, x_{t - ω + i}; θ_{g}), & i > 1 \\ GRU (0, x_{t - ω + i}; θ_{g}), & i = 1 \end{matrix}

(3)

Here, GRU(·) is the standard operation of the gated recurrent unit. Each output hidden state

h_{t - ω + i} (i = 1, \dots, ω)

contains the information from time

t - ω + 1

to

t - ω + i

of the combustion cycle.

θ_{g}

denotes the neural network parameters learned through model training. We can flatten all hidden states as the embedding of the input state, as shown in Equation (4).

e_{t} = Flatten (H_{t})

(4)

Alternatively, we can use the final context vector,

h_{t}

, directly as an embedding of the input state, as shown in Equation (5).

e_{t} = h_{t} = GRU (h_{t - 1}, x_{t}; θ_{g})

(5)

3.2.2. Attention-Based Embedding Network

Attention can be seen as a process of adjusting weights dynamically across all input tokens based on data features. This process has been extensively utilized in sequence process models, especially in natural language processing models [20,21]. Figure 5 shows our attention-based embedding network. To begin, the state data

s_{t} = X_{t - ω + 1 : t}

are input into a linear layer, producing

Z_{t} \in R^{d \times ω}

with dimension d in Equation (6). Then, the learnable positional encoding C, inspired by [22], is added to

Z_{t}

to obtain

H_{t}

in Equation (7). Following this, two linear operations allow for the mapping of the hidden state matrix

H_{t} \in R^{d \times ω}

into a key matrix,

K_{t} \in R^{d \times ω}

, and a value matrix,

V_{t} \in R^{d \times ω}

, respectively, as shown in Equations (8) and (9).

Z_{t} = {z_{1}, z_{2}, \dots, z_{ω}} = W_{z} X_{t - ω + 1 : t} + b_{z}

(6)

H_{t} = {h_{1}, h_{2}, \dots, h_{ω}} = Z_{t} + C

(7)

K_{t} = {k_{1}, k_{2}, \dots, k_{ω}} = W_{k} H_{t} + b_{k}

(8)

V_{t} = {v_{1}, v_{2}, \dots, v_{ω}} = W_{v} H_{t} + b_{v}

(9)

where

W_{z} \in R^{d \times m}, C \in R^{d \times ω}, W_{k} \in R^{d \times d}, W_{v} \in R^{d \times d}, b_{z} \in R^{d}, b_{k} \in R^{d}

and

b_{v} \in R^{d}

are network parameters learned through model training.

The attention score of each element in the sequence can be computed in Equation (10).

A_{t} = (α_{1}, α_{2}, \dots, α_{ω}) = Softmax (q^{T} K_{t})

(10)

where the query

q \in R^{d}

denotes network parameters learned through model training, and

α_{i}

is a scalar value obtained by the Softmax function. Finally, the output of the embedding network can be obtained in Equation (11).

e_{t} = \sum_{i = 1}^{ω} α_{i} v_{i}

(11)

3.2.3. CNN-Based Embedding Network

CNN has fruitfully been applied to a diverse array of tasks, such as image processing and time-series forecasting. The elements comprising the output of a CNN layer can be viewed as weighted aggregations of values from their spatial neighbors. These weighting schemes are referred to as kernels or filters. By sequentially scanning across the temporal order of the sequence via such kernels, CNN possesses the ability to capture the intricate temporal relationships within the data [23]. Leveraging this capability, we designed a CNN-based embedding network. As shown in Figure 6, the network consists of several one-dimensional convolutional layers and a flattening layer. An activation function (e.g., rectified linear unit (ReLU)) is inserted between every two convolutional layers for nonlinear mapping. As for the input state data

s_{t} = X_{t - ω + 1 : t}

, the embedding of

s_{t}

is obtained through several Conv-ReLU layers and a flattening layer, whose calculations are shown by Equations (12) and (13).

H_{t}^{l} = ReLU (Conv (H_{t}^{l - 1}; θ_{l}))

(12)

e_{t} = Flatten (H_{t}^{L})

(13)

where

H_{t}^{l} = {h_{t - ω + 1 + l}^{l}, h_{t - ω + 2 + l}^{l}, \dots, h_{t}^{l}}

represents the output of the l-th layer of the convolutional network, initially

H_{t}^{0} = s_{t}

. L denotes the total number of convolutional layers.

θ_{l}

denotes the network parameters of the l-th layer of the convolutional network learned through model training.

e_{t}

denotes the embedding of the input state

s_{t}

. Conv(·), ReLU(·), and Flatten(·) represent the standard convolution, activation, and flatten operations, respectively.

To summarize, to find the most suitable network for our problem, we propose five models for implementing the RL with different embedding network structures. Specifically, the proposed models encompass (1) RNN2-MLP, which uses GRU to obtain all hidden states

(h_{1}, h_{2}, \dots, h_{ω})

as embedding vectors for the MLP; (2) RNN2C-MLP, which is similar to RNN2-MLP, with the difference being that the last hidden state

h_{ω}

is used as the embedding vector and fed into the MLP; (3) Attention-MLP, which uses the multi-head attention mechanism as the embedding network; (4) Conv3-MLP, which has a three-layer convolutional neural network as the embedding network; and (5) MLP, a model that does not consider the embedding network, i.e., it inputs the state directly into the multilayer perceptron. In Section 4, we will evaluate the performances of the different models.

3.3. Loss Function

The object of training our RL-based model is twofold. First, we aim to ensure that the output Q-values of the deep Q-network approximate the expected return. Secondly, we aim to mimic the expert knowledge from the historical combustion data to learn reasonable network parameters for combustion decision-making. To meet these requirements, we design a hybrid loss function as follows.

3.3.1. Temporal Difference Loss

The Q-function can be approximated by learning from experiences, which indirectly shapes the policy. For the experience tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, we can estimate the Q-value of the state–action pair

(s_{t}, a_{t})

using the Q-network with parameters

θ

, which we denote as

Q (s_{t}, a_{t}; θ)

. Alongside the estimated Q-value, we construct an update target

Y_{t}^{n}

using experience tuples spanning from

(s_{t}, a_{t}, r_{t}, s_{t + 1})

to

(s_{t + n}, a_{t + n}, r_{t + n}, s_{t + n + 1})

. This is formulated according to Equation (14). This specific target reflects a more precise Q-value of state–action pair

(s_{t}, a_{t})

. Finally, the parameters

θ

are optimized. This optimization aims to minimize the discrepancy between the estimated Q-values and the previously mentioned target.

Y_{t}^{n} = r_{t} + γ r_{t + 1} + \dots + γ^{n - 1} r_{t + n - 1} + γ^{n} Q (s_{t + n}, a_{t + n}^{m a x}; θ^{'})

(14)

a_{t + n}^{m a x} = \underset{a \in A}{argmax} Q (s_{t + n}, a; θ)

(15)

Here,

r_{t}, \dots, r_{t + n - 1}

are the n-step rewards from state

s_{t}

to state

s_{t + n - 1}

.

γ \in (0, 1)

is the discount factor.

Q (s_{t + n}, a_{t + n}^{m a x}; θ^{'})

represents the Q-value of the state–action pair

(s_{t + n}, a_{t + n}^{m a x})

, estimated by the target network. The target network shares the same architecture as the Q-network but has different parameters, denoted as

θ^{'}

. These parameters

θ^{'}

are periodically synced with the parameters

θ

of the Q-network, aiding in stabilizing the training process.

a_{t + n}^{m a x}

corresponds to the action with the highest Q-value, as estimated by the Q-network. A is the set of actions.

To make a more smoothed absolute error and avoid excessive parameter updates, the n-step temporal difference (TD) loss

L_{T D}

is given in Equation (16). Here,

Y_{t}^{n} - Q (s_{t}, a_{t}; θ)

is called the n-step TD error.

L_{T D} = \{\begin{matrix} 0.5 {(Y_{t}^{n} - Q (s_{t}, a_{t}; θ))}^{2}, & | Y_{t}^{n} - Q (s_{t}, a_{t}; θ) | < 1 \\ | Y_{t}^{n} - Q (s_{t}, a_{t}; θ) | - 0.5, & otherwise \end{matrix}

(16)

3.3.2. Large Margin Classification Loss

Note that historical combustion data are utilized to avoid direct learning from real industrial environments with high trial-and-error costs. However, relying solely on TD loss is not sufficient. This is because the Q-values of actions absent from the training data may be inconsistent with the actual scenario. To alleviate the issue of counterfactual queries [24,25], a supervised learning loss (denoted as the large margin classification loss) is designed. This loss, as formulated in Equation (17), prompts the agent to approximate the policy that is implicitly derived from the historical data. It achieves this by lowering the Q-values of the actions not selected in the historical data.

L_{E} = max_{a \in A} [Q (s_{t}, a; θ) + d (a_{E_{t}}, a)] - Q (s_{t}, a_{E_{t}}; θ)

(17)

d (a_{E_{t}}, a) = \{\begin{matrix} 0, & a_{E_{t}} = a \\ b, & otherwise \end{matrix}

(18)

Here,

L_{E}

is the large margin classification loss.

a_{E_{t}}

is the action performed for state

s_{t}

in the historical combustion data.

b > 0

is the classification margin threshold.

3.3.3. L2 Regularization Loss

To avoid overfitting, the regularization loss is also implemented, as shown in Equation (19).

L_{L_{2}} = {| | θ | |}_{2}^{2}

(19)

This regularization term restricts the diversity of model parameters, resulting in enhanced generalization capabilities of the model.

3.3.4. Combined Losses

Overall, our loss function is the weighted sum of the three mentioned above, as shown in Equation (20).

L = λ_{1} L_{T D} + λ_{2} L_{E} + λ_{3} L_{L_{2}}

(20)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weights of the losses.

3.4. Training and Inference

Although we designed several RL models with different embedding networks, the ultimate objective remains the same—to minimize the loss of the neural network. This can be achieved through training using gradient descent methods. Since the training process is the same, we use a uniform algorithm to describe it.

Algorithm 1 details the training process of our RL models. After initialization (lines 1–5), a batch of experiences is sampled based on their probability to prioritize vital experiences (lines 7–8). For each experience, we compute the gradient of the loss function and update its sampling priority using the temporal difference error (lines 10–12). The parameters of the Q-network undergo an update through gradient descent to minimize the loss function (lines 14–15). Upon reaching the predetermined frequency of updating the target network parameters,

ψ

, the parameters of the Q-network are synchronized with the target network (lines 16–18).

Algorithm 1: Training the RL model for combustion optimization control of the hot blast stove.

After training, the model exhibits decision-making abilities comparable to the ones observed in historical combustion data. This achievement enables its deployment in a production environment. Finally, real-time state data during the online combustion control process are collected and fed into the model, which outputs the required adjustment direction for the gas valve. Based on the preset adjustment step size, the valve’s adjustment amount can be computed, enabling real-time control.

Note that our model optimizes the TD loss and the supervised learning loss simultaneously, which can improve the policy learning efficiency by absorbing expert experience in the historical combustion data. Moreover, by circumventing the direct interaction with the production environment, our model significantly reduces the trial-and-error cost and provides the possibility of engineering applications.

4. Performance Evaluation

4.1. Experimental Settings

4.1.1. Dataset

To train our proposed RL models, we acquired a dataset of historical production data from the no. 1 hot blast stove of the no. 7 1750 m³ BF at Tranvic Steel Co., Ltd. This dataset consisted of records collected every 3 s, and each record consisted of five variables: gas pressure, air pressure, position of the air valve, position of the gas valve, and dome temperature. After excluding abnormal cycles where combustion records during a combustion period were incomplete, we obtained 27 combustion cycles. We then randomly divided these combustion cycles into training and test sets at a 7:3 ratio. Following our state generation principle outlined in Section 2, we generated 47,739 experiences for training our models.

4.1.2. Models Description and Parameters Setting

Five RL models with different network structures are proposed: RNN2-MLP, RNN2C-MLP, Attention-MLP, Conv3-MLP, and MLP. Continuing with the setup described at the end of Section 3.2, we detail the network structures of these models as follows.

Both RNN2-MLP and RNN2C-MLP are two-layer RL models based on RNN and MLP, in which RNN2C-MLP uses a context vector (C is short for the context vector) as the embedding vector, while RNN2-MLP uses flattened hidden states as the embedding vector. The hidden state size of RNN2-MLP is 16, and that of RNN2C-MLP is 32. Both MLPs in the RNN-based network have two hidden layers, each including 64 neurons. Attention-MLP utilizes an RL model based on Attention and MLP. To implement Attention-MLP, we first map the input records to size 64 using a fully connected layer. Following this, we set the positional encoding, query, key, and value (all 64-dimensional). The MLP for Attention-MLP has two hidden layers with 64 and 32 neurons, respectively. The embedding network of Conv3-MLP, which is based on CNN and MLP, comprises 3 convolutional layers, each with a kernel size of 2 and 32 output channels. In MLP, the hidden layer neuron numbers are 128, 64, and 64, respectively.

Table 1 displays the hyperparameter settings. To ensure more stable training, we implement a learning rate annealing strategy, where the learning rate of the i-th training step is updated by

η_{i} = η_{θ} ζ^{i / K}

.

All these models were implemented using Python 3.10, JAX 0.3.5, and Flax 0.4.0, carried out on a computer running a GNU/Linux operating system equipped with an Intel(R) Core(TM) i7-10700 CPU (Intel, Santa Clara, CA, USA) and 16 GB of RAM (Samsung, Yeongtong-gu, Suwon, Republic of Korea).

4.2. Experimental Results

4.2.1. Accuracy of the Models

Considering the stochastic nature of model training, each model in this paper is trained with four different random seeds, and the mean value of metrics is reported. By calculating the proportion of the output action that matches the direction of adjustment of the gas valve in the test set, the accuracy can be determined. Table 2 shows the accuracies of different models on the test set.

It is shown that the accuracy of each model is above 80%. Among them, Attention-MLP has the highest accuracy of 85.91%, while the MLP model has the lowest accuracy of 80.96%. This indicates that the designed embedding network is effective at extracting the combustion state information of the hot blast stove, which can guide the valve adjustment actions.

Figure 7 demonstrates the changes in the statistical metrics during the training process, where the horizontal coordinate represents the training step while the vertical coordinate records the value of the corresponding statistical indicator. It is shown that the loss decreases, and the accuracy of all the models increases as the training progresses. This demonstrates that the models are capable of effectively learning the data patterns.

4.2.2. Efficiency of the Models

Since our model is intended for use in real-time combustion control, it is imperative that training and inference times be short. To ensure this, we recorded the inference times of different models, including the data preprocessing and warm-up times, as presented in Table 3. The results indicate that MLP has the shortest training time due to its simple network structure; this is followed by the Attention-MLP model and the Conv3-MLP model. These two models based on RNN are the most time-consuming. On average, our method completes one inference in less than 7.61 ms, demonstrating its ability to fulfill real-time control requirements.

4.2.3. Discussion

Hot blast stove combustion is a dynamic process that necessitates analyzing present as well as past data to attain optimal control. To balance computational efficiency with control accuracy, we use a sliding time window strategy to create a temporal sequence as our state representation. To investigate the performances of different embedding networks in this context, we measured the performances of five different RL-based combustion optimization control models for hot blast stoves. All models achieve an accuracy rate of more than 80% on the test set, and the average inference time of the models is less than 7.61 ms, which satisfies the realistic combustion control requirements of the hot blast stove in terms of both computational accuracy and efficiency. Among them, the Attention-MLP-based RL model performs the best, achieving the highest accuracy rate (85.91%), with only 4.85 ms for one inference.

In terms of effectiveness, the Attention mechanism allows the neural network to automatically learn the impact of each data item in the combustion temporal sequence. It is highly effective in concentrating on key parts of the input sequences and consolidating them dynamically, which results in better-informed decision-making. Whereas, the information aggregation of other models is less flexible, in which CNNs only aggregate information about their spatial neighbors and RNNs only aggregate information about the front of the input sequence. Therefore, the Attention-MLP-based RL model can have better accuracy.

Regarding efficiency, the positional encoding of the Attention model helps address permutation symmetry, which makes it possible to perform computations on each item in the sequence simultaneously. These computations involve calculating relevance weights for queries and keys and ultimately obtaining output values by weighing input values according to relevance weights. All of these operations can be carried out using matrix multiplication. The above features of attention models can eliminate the need for loops in the time dimension. In contrast, RNNs have to preserve the sequential information of a sequence by scanning it chronologically, which is more time-consuming.

In the field of control optimization, it is widely accepted that ongoing control of systems based on environmental feedback is necessary throughout entire operational cycles. To guide the operation, it is important to consider the overall effect of the control cycle. By combining reasonable problem modeling, we can still incorporate the benefits of the Attention-MLP-based RL model. Thus, we believe that our method can be applied to more control optimization cases.

4.3. Application

Based on the experimental results, the Attention-MLP-based RL model is chosen for the intelligent combustion control system for the hot blast stove. Our system runs in the no. 1 hot blast stove of the no. 7 BF at Tranvic Steel Co., Ltd. The blast furnace has a capacity of 1750 m³ and is equipped with three top combustion hot blast stoves. The hot blast stoves operate in a two-burn and one-delivery mode. Each stove’s combustion cycle lasts approximately 120 min, while the air delivery time is roughly 60 min. Additionally, the dome temperature averages between 1100 °C and 1370 °C. To fulfill operating demands, the hot blast stove has a rudimentary automation system that can promptly gather real-time data on gas pressure and air pressure, among other parameters. The position of the gas valve can be adjusted from 0% to 100%, and it is controllable by manual or computerized means. Our system is designed to determine the optimal gas valve position at the next time step based on real-time combustion data retrieved from the rudimentary automation system. Once a combustion cycle starts, the system will collect data every 3 s to determine the optimal gas valve position and execute it through the rudimentary automation system.

To validate the performance of our system and avoid any incorrect operation, it works with the previously developed expert system for the hot blast stove. When the flow meters work, both systems make control decisions, and the decision made by the expert system is ultimately adopted. Once the flow meters malfunction, which means the expert system does not work due to the lack of parameters, the decision of our RL-based system is adopted. Figure 8 shows the control results of the system when the gas flow meter malfunctions.

As depicted in Figure 8, the gas valve position advocated by our model, indicated by the resplendent yellow line, and that recommended by the expert system, illustrated by the striking cyan line, exhibit a remarkable similarity in their overall trajectory. Notably, within zone A, the gas valve position proposed by our system aligns harmoniously with the expert system’s suggestion. However, in Zone B, due to the failure of the gas flow meter, the dome temperature of the hot blast stove represented by the green line was challenging to maintain under the control of the expert system. In zone C, the system was switched from the expert system to our RL system, and the dome temperature climbed back to the optimal range. It is shown that our RL-based method can autonomously learn the hidden relationship between the combustion state and the valve adjustment action, thus achieving combustion control optimization for hot blast stoves.

5. Conclusions

Combustion optimization of hot blast stoves is critical for reducing steel production costs and conserving energy. However, it can be challenging to apply existing artificial intelligence technologies when flow meters are absent or malfunctioning. To address this challenge, we propose an intelligent combustion control method for hot blast stoves based on reinforcement learning. The main categories of this article are summarized as follows.

(1) The formal model of combustion optimization for a hot blast stove is established, where the states, actions, and rewards of the combustion control process are formalized to guide the reinforcement agents in formulating their policy.

(2) Five RL models (using different deep embedding networks) are implemented to address the difficulty of learning state feature representation. The Attention-MLP-based RL model is distinguished through experimental testing; its accuracy reaches 85.91% and the average inference time is only 4.85 ms.

(3) Our system runs in the no. 1 hot blast stove of the no. 7 BF (1750 m³ in volume) at Tranvic Steel Co., Ltd. The results demonstrate that the system can autonomously learn the implicit relationship between the combustion state and the valve adjustment actions. It performed well even when the flow meters malfunctioned.

Our study may have a few potential areas for refinement. Firstly, this study adopted a strategy of maintaining the position of the air valve while regulating the gas valve. The control action of our model indicates the adjustment direction with simple rules for obtaining the adjustment amount. In future research, we will aim to improve control precision by refining control granularity, such as establishing the exact valve position adjustment values and integrating air valve position control. Secondly, the models we trained are only limited to the hot blast stove where the training data are collected. So, we will explore the possibility of migrating the control policy across different hot blast stoves. This measure has the potential to increase the usability of the model and benefit new hot blast stoves that lack historical combustion data. Despite these limitations, our work is innovative and promising compared to existing research. We will conduct further research to address these limitations.

Author Contributions

Conceptualization, T.Y. and H.L.; methodology, T.Y., H.L., B.Y. and H.G.; software, T.Y.; validation, T.Y.; formal analysis, T.Y. and H.L.; investigation, T.Y. and H.L.; resources, T.Y.; data curation, T.Y.; writing—original draft preparation, T.Y. and H.L.; writing—review and editing, T.Y., H.L., B.Y. and H.G.; visualization, T.Y.; supervision, H.L.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grant nos. 52074185, 61902269, and 51774209; Provincial Key Laboratory for Computer Information Processing Technology, Soochow University (grant number KJS1938).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	reinforcement learning
BF	blast furnace
TD	temporal difference
PID	proportional–integral–derivative
DQN	deep Q-network
MLP	multilayer perceptron
RNN	recurrent neural network
GRU	gated recurrent unit
CNN	convolutional neural network
ReLU	rectified linear unit

References

Xu, T.; Ning, X.J.; Wang, G.W.; Liang, W.; Zhang, J.L.; Li, Y.J.; Wang, H.Y.; Jiang, C.H. Combustion characteristics and kinetic analysis of co-combustion between bag dust and pulverized coal. Int. J. Miner. Metall. Mater. 2018, 25, 1412–1422. [Google Scholar] [CrossRef]
Fan, Z.; Friedmann, S.J. Low-carbon production of iron and steel: Technology options, economic assessment, and policy. Joule 2021, 5, 829–862. [Google Scholar] [CrossRef]
Aksyushin, M.; Kalugin, M.; Malikov, G.; Yaroshenko, Y. Improvement of the energy efficiency of hot blast stove performance. In Advanced Methods and Technologies in Metallurgy in Russia; Springer: Cham, Switzerland, 2018; pp. 169–176. [Google Scholar]
Tang, W.; Tang, X.; Wang, W.; Chen, W. Application of fuzzy PID control in the hot stove combustion system. China Instrum. 2013, 1, 51–57. [Google Scholar]
Wang, X.; Yu, X.; Sun, Y. Application of advanced control technology on combustion optimization control system in hot blast stove. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC), Shenyang, China, 9–11 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5990–5992. [Google Scholar]
Wang, H.; Cao, S.; Dong, Q.; Cui, Y.; Cao, Z.; Xu, S.; Song, X.; Shen, H. Optimization and control of working parameters of hot blast furnace. In Proceedings of the MATEC Web of Conferences, Guangzhou, China, 11 May 2018; EDP Sciences: Les Ulis, France, 2018; Volume 175, p. 02030. [Google Scholar]
Zetterholm, J.; Ji, X.; Sundelin, B.; Martin, P.M.; Wang, C. Model development of a blast furnace stove. Energy Procedia 2015, 75, 1758–1765. [Google Scholar] [CrossRef]
Zetterholm, J.; Ji, X.; Sundelin, B.; Martin, P.M.; Wang, C. Dynamic modelling for the hot blast stove. Appl. Energy 2017, 185, 2142–2150. [Google Scholar] [CrossRef]
Gan, Y.; Yang, F.; Pan, C.; Gao, X.; Lee, C.; Liao, S. Research on operating characteristics of a hot blast stove. In Proceedings of the 2021 IEEE International Conference on Networking, Sensing and Control (ICNSC), Xiamen, China, 3–5 December 2021; IEEE: Piscataway, NJ, USA, 2021; Volume 1, pp. 1–6. [Google Scholar]
Guo, H.W.; Yan, B.J.; Zhang, J.L.; Chen, S.S. Fuzzy control expert system of hot blast stove based on simulation and thermal balance. In Proceedings of the 8th Pacific Rim International Congress on Advanced Materials and Processing, Waikoloa, HI, USA, 4–9 August 2013; Springer: Cham, Switzerland, 2013; pp. 3099–3108. [Google Scholar]
Zhang, C.; Li, Y. A combustion control strategy of hot blast stove based on kernel fuzzy c-means (FCM). Metalurgija 2019, 58, 179–182. [Google Scholar]
Zhai, X.; Chen, M.; Lu, W. Fuel ratio optimization of blast furnace based on data mining. ISIJ Int. 2020, 60, 2471–2476. [Google Scholar] [CrossRef]
Zhang, C.; Li, Y. Markov hot blast stove prediction based on hybrid intelligent algorithm. Metalurgija 2020, 59, 451–454. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations, San Juan, PR, USA, 2–4 May 2016; Bengio, Y., LeCun, Y., Eds.; OpenReview.net: Amherst, NY, USA, 2016. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Nian, R.; Liu, J.; Huang, B. A review on reinforcement learning: Introduction and applications in industrial process control. Comput. Chem. Eng. 2020, 139, 106886. [Google Scholar] [CrossRef]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1243–1252. [Google Scholar]
Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al. Deep q-learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar]

Figure 1. The control process of a hot blast stove.

Figure 2. States generated by the combustion data.

Figure 3. Network structure for estimating Q-values.

Figure 4. RNN-based embedding network.

Figure 5. Attention-based embedding network.

Figure 6. CNN-based embedding network.

Figure 7. Statistical results during the training process.

Figure 8. An example of the online operation of our system (A: expert system working as expected; B: gas flow meter failure; C: only our RL system works).

Table 1. Hyperparameter settings.

Hyperparameter	Value	Hyperparameter	Value
Initial learning rate $η_{0}$	0.0001	TD loss weight $λ_{1}$	1
Decay rate of learning rate $ζ$	0.2	Large margin classification loss weight $λ_{2}$	20
Number of iterations K	400,000	L2 loss weight $λ_{3}$	0.02
Batch size B	64	Target network update interval $ψ$	1000
Discount factor $γ$	0.99	Classification margin b	0.8
Time window size $ω$	15

Table 2. Accuracies of different models (mean ± std).

Model	Accuracy (%)	Model	Accuracy (%)
RNN2-MLP	84.76 ± 0.44	RNN2C-MLP	83.65 ± 0.34
Attention-MLP	85.91 ± 0.78	Conv3-MLP	84.33 ± 0.55
MLP	80.96 ± 0.98

Table 3. Time for inference of different models (mean ± std).

Model	Time for Inference (ms)	Model	Time for Inference (ms)
RNN2-MLP	5.45 ± 0.06	RNN2C-MLP	7.61 ± 0.06
Attention-MLP	4.85 ± 0.09	Conv3-MLP	5.28 ± 0.04
MLP	2.66 ± 0.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, T.; Guo, H.; Liang, H.; Yan, B. Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach. Processes 2023, 11, 3140. https://doi.org/10.3390/pr11113140

AMA Style

Yang T, Guo H, Liang H, Yan B. Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach. Processes. 2023; 11(11):3140. https://doi.org/10.3390/pr11113140

Chicago/Turabian Style

Yang, Taoran, Hongwei Guo, Helan Liang, and Bingji Yan. 2023. "Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach" Processes 11, no. 11: 3140. https://doi.org/10.3390/pr11113140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Combustion Control of the Hot Blast Stove: A Reinforcement Learning Approach

Abstract

1. Introduction

2. Formal Model of the Combustion Control Problem

3. Deep Reinforcement Learning Models for Problem-Solving

3.1. DQN for Optimizing Hot Blast Stove Control

3.2. Structure of DQN

3.2.1. RNN-Based Embedding Network

3.2.2. Attention-Based Embedding Network

3.2.3. CNN-Based Embedding Network

3.3. Loss Function

3.3.1. Temporal Difference Loss

3.3.2. Large Margin Classification Loss

3.3.3. L2 Regularization Loss

3.3.4. Combined Losses

3.4. Training and Inference

4. Performance Evaluation

4.1. Experimental Settings

4.1.1. Dataset

4.1.2. Models Description and Parameters Setting

4.2. Experimental Results

4.2.1. Accuracy of the Models

4.2.2. Efficiency of the Models

4.2.3. Discussion

4.3. Application

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI