Next Article in Journal
Convergence Criteria for Fixed Point Problems and Differential Equations
Next Article in Special Issue
Summary-Sentence Level Hierarchical Supervision for Re-Ranking Model of Two-Stage Abstractive Summarization Framework
Previous Article in Journal
Velocity Field due to a Vertical Deformation of the Bottom of a Laminar Free-Surface Fluid Flow
Previous Article in Special Issue
Multi-Target Feature Selection with Adaptive Graph Learning and Target Correlations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

State-Space Compression for Efficient Policy Learning in Crude Oil Scheduling

1
School of Information Science and Engineering, China University of Petroleum, Beijing 102249, China
2
Petrochina Planning and Engineering Institute, Beijing 100083, China
3
Key Laboratory of Oil & Gas Business Chain Optimization, CNPC, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(3), 393; https://doi.org/10.3390/math12030393
Submission received: 17 December 2023 / Revised: 14 January 2024 / Accepted: 19 January 2024 / Published: 25 January 2024

Abstract

:
The imperative for swift and intelligent decision making in production scheduling has intensified in recent years. Deep reinforcement learning, akin to human cognitive processes, has heralded advancements in complex decision making and has found applicability in the production scheduling domain. Yet, its deployment in industrial settings is marred by large state spaces, protracted training times, and challenging convergence, necessitating a more efficacious approach. Addressing these concerns, this paper introduces an innovative, accelerated deep reinforcement learning framework—VSCS (Variational Autoencoder for State Compression in Soft Actor–Critic). The framework adeptly employs a variational autoencoder (VAE) to condense the expansive high-dimensional state space into a tractable low-dimensional feature space, subsequently leveraging these features to refine policy learning and augment the policy network’s performance and training efficacy. Furthermore, a novel methodology to ascertain the optimal dimensionality of these low-dimensional features is presented, integrating feature reconstruction similarity with visual analysis to facilitate informed dimensionality selection. This approach, rigorously validated within the realm of crude oil scheduling, demonstrates significant improvements over traditional methods. Notably, the convergence rate of the proposed VSCS method shows a remarkable increase of 77.5 % , coupled with an 89.3 % enhancement in the reward and punishment values. Furthermore, this method substantiates the robustness and appropriateness of the chosen feature dimensions.

1. Introduction

The orchestration of crude oil storage and transportation scheduling is pivotal at the forefront of refinery operations, underpinning the safety of oil storage and transit, the stability of production, and the operational efficiency of the refinery [1]. This complex process encompasses the unloading of tankers, the coordination of terminal and factory tank storage, and the seamless transfer of resources to the processing apparatus. Effective scheduling requires intricate decision making across various operational phases, including the timely and precise movement of crude oil to designated units [2]. Objectives focus on maintaining uninterrupted processing, minimizing tanker delays, and optimizing resource allocation across storage and processing units. Operational dispatch must also navigate a myriad of practical considerations, from the punctuality of tanker arrivals to the preparedness of storage facilities and the interconnectivity of various systems. Addressing this large-scale, multiconstraint scheduling challenge is pivotal, representing a dynamic research frontier demanding innovative and efficient solutions.
Contemporary research methodologies addressing refinery crude oil scheduling predominantly draw upon operations research theory [3,4]. These approaches typically entail the formulation of the problem into a mathematical model amenable to solution [5,6,7]. The strength of this strategy lies in its capacity for the precise mathematical articulation of the scheduling process and production objectives, as well as in its ability to identify provably optimal solutions. However, the timeliness of these solutions poses a significant challenge. Presently, refinery crude oil scheduling is often represented and tackled as a large-scale mixed integer programming model, characterized as an NP-hard problem. Absent simplification, such models defy resolution within a practical timeframe.
Recent advancements in deep reinforcement learning have led to notable successes in tackling complex planning problems [8], prompting numerous research initiatives and applications in the realm of production resource scheduling with promising outcomes [9,10,11,12]. This methodology models business challenges as Markov decision processes and learns policies that maximize cumulative rewards through sustained interaction with the environment. Its core strengths lie in its neural-network-based approximation capabilities, rapid sequential decision making, and a degree of adaptability in addressing dynamic programming challenges [13]. Yet, when applied to actual industrial problems, these methods often grapple with expansive state spaces, extended training durations, and convergence difficulties [14], signaling the need for more efficient methods.
This study introduces a novel approach, termed Variational Autoencoder for State Compression in Soft Actor–Critic (VSCS), to model and expedite the training of deep reinforcement learning for refinery scheduling tasks. Initially, this research delineates the Markov decision process for refinery scheduling to lay the groundwork for subsequent optimization. The VSCS methodology employs a variational autoencoder to transmute the extensive, high-dimensional state space into a condensed, low-dimensional representation. Utilizing these distilled features, the VSCS algorithm learns the optimal policies in the reduced feature space, substantially enhancing both the learning efficiency and the efficacy of the derived policies. The paper’s principal contributions are multifaceted, encompassing the following key dimensions:
  • A novel deep reinforcement learning framework, VSCS, is presented, employing a variational autoencoder to distill the complex, high-dimensional state space of refinery crude oil scheduling into a compact, low-dimensional feature space for optimal policy identification.
  • To address the challenge of selecting the dimensionality for low-dimensional features, we devised a method that rigorously evaluates the similarity of feature reconstructions. This approach, integrated with visual analytics, enables the precise determination of the optimal dimensionality for low-dimensional features.
  • The VSCS approach delineated herein underwent comprehensive experiments within the crude oil scheduling problem, conclusively affirming the framework’s efficacy. Experimental validation confirmed the appropriateness of the chosen low-dimensional feature dimensions, establishing a robust empirical foundation for the methodology.
The remainder of this paper is organized as follows. A brief review of related work is presented in Section 2. Section 3 shows the problem formulation. Section 4 presents the details of the VSCS method. Section 5 delineates and deliberates upon the principal experimental outcomes. Finally, some concluding remarks are given in Section 6.

2. Related Work

Crude oil storage and transportation scheduling are critically important to refinery production. This sequential decision-making process encompasses oil tanker arrival and unloading at the port, the conveyance of crude oil from terminal storage to in-plant tanks, and the subsequent delivery of crude materials to processing units. The overarching objective of scheduling is to minimize the cumulative costs, such as operational expenses, while adhering to the operational capabilities of each segment and maintaining the continuous, planned operation of processing units [15].
Production scheduling presents a multifaceted challenge extensively explored within the mathematical programming sphere, with research bifurcating into modeling methodologies and algorithmic solutions. Shah et al. pioneered a discrete-time Mixed-Integer Linear Programming (MILP) framework to navigate the intricacies of crude oil scheduling [16]. Advancing this groundwork, J.M. Pinto et al. crafted mixed-integer optimization models that capture the dichotomy of continuous and discrete temporal dynamics for refinery scheduling [17]. Jialin Xu’s team leveraged continuous-time models for the simulation optimization of refinery operations, showcasing efficacy in scheduling and economic performance [1]. Further refining these approaches, Bernardo Zimberg et al. employed continuous-time models with intricate multioperation sequencing, achieving hourly resolution in their analyses [18]. Lijie Su introduced an innovative continuous–discrete-time hybrid model that stratifies refinery planning and scheduling into hierarchical levels, focusing on multiperiod crude oil scheduling with the aim of maximizing net profits, achieving solution times that range from minutes to hours [19]. Algorithmically, solutions span from MILP-NLP decomposition to solver-integrated responses [20,21,22] and rolling horizon strategies for time-segmented problem-solving [23]. Additionally, intelligent search mechanisms like genetic algorithms have been adopted to bolster solution throughput [24,25,26,27]. Traditional algorithms have thus concentrated on the meticulous detail of model construction and improving efficiency in confronting the complexities of refinery oil storage and transportation. Modeling has progressed from linear representations to intricate nonlinear continuous-time frameworks to mirror operational realities more closely. Nevertheless, the elevated complexity of such models demands the decomposition of problems into tractable subproblems suitable for solver optimization or the application of heuristics and genetic algorithms for more rapid approximate solutions. Consequently, advancing the performance of solutions in this domain remains an ongoing and formidable research challenge. Table 1 shows the different scales and corresponding performances of the calculation examples in the traditional method research of the crude oil scheduling problem.
Deep reinforcement learning (DRL) has emerged as a potent tool for complex decision-making challenges, with its application broadening significantly in recent years [28]. The method distinguishes itself through formidable learning and sequential decision-making capabilities, facilitating swift, dynamic scheduling decisions in diverse real-world scenarios. In the realm of manufacturing, Christian D. et al. employed DRL in the scheduling of chemical production, adeptly managing uncertainties and facilitating on-the-fly processing decisions, thereby surpassing the performance of MILP models [29]. Yong et al. pioneered a DRL-based methodology for dynamic flexible job-shop scheduling (DFJSP), focused on curtailing average delays through policy network training via the DDPG algorithm, thereby eclipsing rule-based and DQN techniques [30]. Che et al. aimed to curtail total operational expenditures to minimize energy usage and reduce the frequency of operational mode transitions, enhancing stability. For this, they utilized the PPO algorithm to train decision networks, yielding quantifiable improvements in cost-efficiency and mode-switching [31]. Lee et al. harnessed DRL to orchestrate semiconductor production line scheduling to align with production agendas, selecting DQN as the algorithm of choice and establishing strategies apt for dynamic manufacturing environments [32]. In the transportation field, Yan et al. addressed the intricacies of single-track railway scheduling, which encompasses train timetabling and station track allocation, via a sophisticated deep learning framework, securing superior results in large-scale scenarios in comparison with the commercial solver IBM CPLEX [33]. Furthermore, Pan et al. implemented hierarchical reinforcement pricing predicated on DDPG to solve the intricate distribution puzzles presented by shared bicycle resources, consequently achieving enhancements in service quality and bicycle distribution [34].
The extant research reveals that prevailing reinforcement learning methodologies face constraints in their deployment for large-scale industrial applications. These constraints arise from the considerable scale and intricacy of the scenarios, which give rise to extensive state–action spaces, thus hindering the efficiency of learning processes [14,35,36]. Within the domain of refinery crude oil scheduling, analogous challenges are encountered. To mitigate these challenges, the present study proposes the VSCS framework, which transposes the original, high-dimensional state space into a more compact, lower-dimensional feature space, thereby improving the learning process for the complexities of crude oil scheduling tasks.

3. Problem Formulation

3.1. Description of the Refinery Scheduling Problem

The refinery scheduling problem presented in this paper can be depicted as an operational process, as illustrated in Figure 1. It encompasses the arrival of crude oil tanker V a at the port for unloading into designated port storage tanks. These tanks include owned storage vessels V d and commercial storage vessels V b . Following the desalting and settling operations of crude oil, the port storage tanks can transfer the oil to the in-plant tanks V f as required via the long-distance pipeline V p . Terrestrial crude oil V l enters the in-plant storage tanks through the pipeline. The in-plant tank area is tasked with blending different types m M of crude oil according to the processing schemes of the processing units V u and transporting them to the processing units for refining.
The initial conditions for the scheduling decision process include the anticipated arrival time of oil tankers and the storage tanks projected for unloading, the type of crude oil and the liquid level heights ( L v i m , t 0 ) stored in each tank at the outset, the upper and lower limits of tank liquid levels ( C U b , C L b ) , the upper limit of the long-distance pipeline C p , and the topology of the scheduling network. The operational constraints considered are as follows:
  • Within a single cycle, each tank must contain only one type of oil product.
  • Communal storage tanks and dock tanks can only commence oil transfer operations after completing static desalting.
  • The liquid levels in all storage tanks must be maintained within the specified upper and lower capacity limits.
  • The transfer rates must remain within the safe transfer speed range.
  • Crude oil transported via overland pipelines enters the factory tanks at a predetermined rate.
  • The processing units must operate continuously in accordance with the specified processing schemes and plans.

3.2. Markov Modeling

The scheduling objective of this study is to devise a decision-making scheme that minimizes scheduling costs within a short cycle of seven days (with each time step being four hours) while considering the operational constraints of refinery scheduling and the continuity of processing units. The decision scheme includes the oil transfer rates and target tanks for each storage unit. The refinery crude oil scheduling issue can be viewed as a sequential decision-making problem, where the operational process can be described by the fact that the state of each node in the refinery’s crude oil storage and transportation operation in the next period is based on the decisions made in the current period, hence the scheduling issue can be modeled as a Markov decision process.
In the refinery scheduling Markov decision process, the type and level of materials in each storage tank are closely related to the scheduling objectives following operational execution. Moreover, the refinery’s processing units require continuous feeding according to the processing plan; thus, the remaining processing volumes of various materials in the units must also be considered. Based on these considerations, the state is defined as follows, as illustrated in Equation (1).
S = S v a t , S v b t , S v d t , S v p t , S v f t , S v u t
where S v a t includes L v a m , t , which is the remaining unloading time of the tanker and other attribute information (such as node name). S v b t , S v d t , S v f t , respectively, represent the corresponding tank-level information L v b m , t , L v d m , t , L v f m , t , other attribute information (such as node name), etc. S v p t represents the oil head information connecting the terminal pipe area and the commercial storage pipe area, pipeline transportation volume L p t , etc.; S v u t includes the processing plan and the remaining processing volume of the device.
For the refinery’s crude oil storage and transportation scheduling problem, the decision-making network is required to determine the appropriate scheduling actions in response to the varying states at each time period t. The action space is defined by the operational requirements of each node, with the specific action definitions provided in Equation (2).
A = A V a t , A V b t , A V d t , A V p t , A V u t
where A V a t represents the joint decision-making action of the V a node, including the oil unloading speed. A V b t , A V d t are the joint decision-making actions of V b and V d , respectively, including the oil payment speed of commercial storage tanks and terminal tank node and the oil payment target node. A V p t is the pipeline transportation speed, and A V u t includes processing speed.
In the proposed refinery crude oil scheduling model, each action executed during a scheduling step is assessed by the system through corresponding rewards, which serve to evaluate the efficacy of the action strategy. The objective of this model is to concurrently minimize operational events and maximize adherence to production constraints, according to the stipulated full-cycle processing plan. To facilitate the agent’s strategy enhancement in alignment with this objective during training, the reward function is crafted to precisely guide action decisions. This function is expressed through Equations (3)–(6). Given that the algorithm aims to optimize the long-term average reward, the reward function is structured with negative values that are proportional to associated costs.
R = ω 0 R 0 ω 1 R 1 ω 2 R 2
R 0 = t T i b , d , f O d × C U b v i L v i m , t + L v i m , t C L b V i + t T O d × L p t C p + t > T a O a × L v a m , t
R 1 = O p × N P a + N P b + N P d + N P f + O b × N b d + N b f + i V u N b i
R 2 = t T O c L v i m , t
As shown in the above equation, R consists of three parts, where ω is the weight factor of each part; R 0 is the reward and punishment for exceeding the operation constraint, which is composed of each storage tank and the pipeline exceeding the operation constraint and the oil tanker overdue constraint; and R 1 is the speed fluctuation reward and punishment, that is, operation switching. The rewards and punishments are, respectively, composed of the oil tanker speed unloading switching, the oil payment switching of each storage tank, the processing device processing speed switching, and the reward and punishment for the oil type switching. R 2 is the reward and punishment for the inventory cost.
In our model, O a denotes the cost coefficient associated with the delay in oil tanker unloading, with T a representing the corresponding delay time. O p is defined as the cost coefficient for speed fluctuations, while N P v indicates the number of such fluctuations at each node. The term O b refers to the cost incurred due to switching between different types of oil, with N b v quantifying the frequency of these oil species switches at each node. Lastly, O d represents the cost coefficient for instances when the liquid level exceeds predetermined upper and lower limits, and O c signifies the cost coefficient related to inventory management.

4. The Proposed VSCS Algorithm

4.1. The Framework of VSCS

The VSCS framework introduced in this study comprises two primary modules: the low-dimensional feature generation module and the policy learning module. The former autonomously extracts a condensed, low-dimensional feature representation, while the latter module leverages these features to facilitate efficient policy learning. Figure 2 delineates the structural organization and operational sequence of the VSCS framework within the context of refinery crude oil resource scheduling.
As depicted in Figure 2, the policy learning module, rooted in deep reinforcement learning, principally employs the Soft Actor–Critic (SAC) framework. This framework encompasses a policy network, a state value network, and an action value network. The objective is to deduce the appropriate reward feedback following state transitions within the refinery’s crude oil storage and transportation scheduling environment. This is achieved by reconstructing the state into a lower-dimensional representation for efficient network training and subsequent action strategy formulation. The state low-dimensional feature generation module functions as a pretraining mechanism, utilizing an encoder network trained via the VAE architecture to transform the state space into a reduced feature space. This transformation is instrumental in facilitating the strategic training of the main framework. Each module is expounded upon in the subsequent sections.

4.2. Low-Dimensional Feature Generation Module

The objective of the low-dimensional feature generation module is to transmute the original, high-dimensional state space into a more tractable, low-dimensional state space while preserving the integrity of the state information to the greatest extent possible. This study employs a VAE to produce low-dimensional state features through unsupervised learning [37]. The VAE operates as a probabilistic model grounded in variational inference, comprising two primary components. The first is the encoder, which is tasked with condensing the high-dimensional state X into a compact, low-dimensional representation Z, which obeys Gaussian distribution and is composed of μ and σ generated by the encoder. The complementary component of the VAE is the decoder, which functions to regenerate the original features by reconstructing the latent variable Z back into the state transition vector X , as illustrated in Figure 3. More computation details are shown in Algorithm 1.
In accordance with Bayesian principles, the joint probability distribution of the observed state vector X and the latent variable Z can be represented as depicted in Equation (7).
p ( Z X ) = p ( X Z ) p ( Z ) / p ( X )
However, due to the intractability of p ( X ) , this study introduces an alternative distribution to approximate p ( Z X ) . This approximative distribution, denoted as q β ( Z X ) , serves as an estimation of the posterior model (encoder), whereby Z is derived from X. The distribution denoted as p η ( X Z ) p η ( Z ) corresponds to the generative model (decoder). The encoder and decoder training process involves the concurrent learning of parameters β and η .
A central aspect of this work is the simultaneous training of the approximate posterior model and the generative model by maximizing the variational lower bound, which is articulated in Equation (8).
ζ = D K L q β ( Z X ) p β ( Z ) + E q β ( Z | X ( i ) ) log p η X ( i ) Z
The framework presumes that p η ( Z ) adheres to a Gaussian distribution, delineated in Equation (9), with Z derived through Gaussian sampling as per Equation (10). Herein, μ represents the mean, σ denotes the variance, and i is the index of the sample.
p η ( Z ) N ( 0 , 1 )
q β Z X ( i ) N μ ( i ) , σ 2 ( i )
The loss function of this model comprises two components: the Kullback–Leibler (KL) divergence and the reconstruction loss, with the inferable outcomes delineated in Equation (11). Here, x i signifies the encoder network’s input, and x i denotes the output of the decoder network.
ζ = 1 2 n j = 1 n j = 1 n μ j 2 + σ j 2 1 log σ j 2 + 1 2 n i = 1 n X i X i 2
From the foregoing equation, the term D K L q β ( Z X ) p η ( Z ) represents the approximation capability of the approximate posterior model, while E q β ( Z | X ( i ) ) log p η X ( i ) Z signifies the reconstructive ability of the generative model to regenerate X from Z. Consequently, this methodology can be employed to derive low-dimensional features from the initial state of crude oil storage and transportation dispatch, thereby attaining a reconstructed state that mirrors the description of the original state information to the greatest extent feasible.
Algorithm 1 Steps of computation in low-dimensional feature generation module
1:
Initialize: D , q β ( Z X ) , p η ( X Z ) , β , η
2:
while ( β , η ) not convergence do
3:
    M D
4:
    Z Random sample from Gaussian distribution N ( μ , σ 2 )
5:
   Compute ζ and its gradients
6:
   Update ( β , η )
7:
end while
8:
return  β , η

4.3. Policy Learning Module

Leveraging the low-dimensional feature generation module, it is possible to produce a low-dimensional feature vector of the environment’s original state, which facilitates the ensuing policy learning process. To guarantee the efficiency of policy training, the policy generation module in this study adopts the SAC framework as the principal structure for policy learning. This framework, predicated on the theory of entropy maximization, ensures that network updates equilibrate the maximization of expected returns with entropy, thereby enhancing the network’s exploration capabilities and expediting the learning process. The objective function is articulated in Equation (12).
π * = arg max π E s t , a t π t = 0 γ t r s t , a t + α H π · s t
H π · s t = E log π · s t
In Equation (12), r denotes the reward function, and γ is the discount factor, while α signifies the entropy regularization coefficient, employed to modulate the significance of entropy in the learning process. In Equation (13), H represents the entropy value. A greater entropy value corresponds to a heightened level of exploration by the agent, promoting a more thorough investigation of the action space.
The training network within this framework comprises a policy network π ϕ , an action value network Q θ 1 , θ 2 a t , s t , and a target network, which are parameterized by Φ , θ 1 , and θ 2 , respectively. The action value network Q θ 1 , θ 2 a t , s t incorporates a dual Q-network structure. The soft Q-value is determined by taking the minimum value from two Q-value functions parameterized by θ 1 and θ 2 . This approach is designed to mitigate the overestimation of inappropriate Q-values and to enhance the speed of training. The soft Q-value function is refined by minimizing the Bellman error, as detailed in Equation (15).
J Q ( θ ) = E s t , a t D 1 2 Q θ i = 1 , 2 s t , a t r s t , a t + γ V φ ¯ s t + 1 2
V φ ¯ s t + 1 = Q θ ¯ s t + 1 , a t + 1 α log π ϕ a t + 1 s t + 1
where V φ ¯ s t + 1 represents the state value of the agent at time t + 1 , and Q θ ¯ s t + 1 , a t + 1 can be estimated using the target network.
Policy network π ϕ is updated by minimizing the KL divergence, as shown in Equation (16).
J π ( ϕ ) = E a t π , s t D log π ϕ s t , a t min i = 1 , 2 Q θ i s t , a t
The proposed VSCS method is outlined in Algorithm 2.
Algorithm 2 The proposed VSCS Algorithm
  1:
Initialize: N encoder in VAE, θ 1 , θ 2 , ϕ in Q network and policy network.
  2:
θ 1 ¯ = θ 1 ,   θ 2 ¯ = θ 2 . Initialize experience buffer D
  3:
for each iteration do
  4:
   for each environment step do
  5:
       a t = π ϕ ( a t | s t )
  6:
       s t + 1 = p ( s t + 1 | s t , a t )
  7:
       s t = N enc ( s t )
  8:
       s t + 1 = N enc ( s t + 1 )
  9:
       D = D { s t , a t , r t , s t + 1 }
10:
   end for
11:
   for each gradient step do
12:
      Sample from D ;
13:
      Calculate the loss and update the action value network according to Equations (14) and (15)
14:
      Calculate the loss and update the policy network according to Equation (16)
15:
      Update the entropy regularization coefficient α
16:
      Update the parameters of the target Q-network
17:
   end for
18:
end for

5. Experiment

To validate the efficacy of the proposed approach, this study conducts comprehensive experiments on the crude oil scheduling problem. The experiments include the following:
  • Comparing the VSCS method introduced in this study with baseline algorithms using a dataset of refinery crude oil storage and transportation scheduling from an actual scenario.
  • Analyzing the performance of the algorithm at various compression scales to determine the optimal low-dimensional feature dimensionality.
  • Conducting a similarity analysis between low-dimensional reconstructed state features and original state samples and proposing a state reconstruction threshold for refinery crude oil scheduling problems based on reconstruction similarity.
  • Evaluating the performance of the proposed algorithm by visualizing the low-dimensional features.
The goal of these experiments is to thoroughly assess the advantages and practical applicability of the proposed VSCS method in real-world crude oil scheduling tasks.

5.1. Data for Simulator

This investigation employs a dataset from a bona fide operational context within an oil company, encompassing various node types and their attributes, such as oil tankers, terminal tanks, commercial storage tanks, in-plant tanks, and processing devices, as delineated in Section 3. The dataset details encompass tanker oil load by type and volume, the initial liquid levels in storage tanks, the types of oil they house, storage capacities, transfer capabilities, and their processing apparatus’ schemes and capacities. Integral to this study’s reinforcement learning framework, the simulator accurately emulates the intricate and dynamic processes of crude oil storage and transportation within a refinery. The experimental setup utilizes a single oil tanker, 14 terminal storage tanks, 9 in-plant storage tanks, and 2 processing devices. This simulator facilitates an interactive learning milieu for the proposed algorithm, enabling adaptive training against the evolving dynamics of the refinery environment, providing continual feedback throughout the training phase, and assessing the algorithm’s efficacy. The data input for the low-dimensional feature generation module is derived from sampling the experience pool within the aforementioned simulation environment, with a sampling scale consisting of 2048 random state samples, each with 61 dimensions.
In this study, the benchmark comparison is conducted against the SAC algorithm, a model premised on entropy maximization theory [38,39]. This approach ensures that updates to the training network balance the maximization of expected returns with entropy, thereby enhancing the algorithm’s capacity for exploration and expediting the learning process.

5.2. Comparison with Baseline Algorithm

This section evaluates the enhanced performance of the proposed VSCS algorithm with respect to training convergence speed and the value of the final reward obtained post learning. To assess the stability of the algorithm following state reconstruction via VAE, the SAC algorithm is employed for baseline comparison. The experimental procedure involved multiple tests using diverse random seeds to determine the average learning efficacy of both the proposed algorithm and the baseline algorithm across ten different sets of random seeds. The learning performance is depicted through an average learning curve for clarity. Furthermore, in the experimental results, the rewards are logarithmically transformed for more coherent representation, as depicted in Figure 4. The principal parameters for the proposed VSCS algorithm is summarized in Table 2.
Table 2 demonstrates that the VSCS algorithm proposed in this study markedly outperforms the baseline algorithm regarding the final reward value attained, showcasing an 89.3 % enhancement in the final average reward value. In terms of training efficiency, the VSCS algorithm achieves the maximum reward in just 47 iterations. This represents a 77.5 % increase in the rate at which training attains a stable state compared with the baseline algorithm. Additionally, the VSCS algorithm exhibits superior training stability relative to the baseline.
The reconstruction and compression of the state dimension prior to training the SAC network results in a significant reduction in the required sample size during the training process. This efficiency gain in sample size directly translates to enhanced network training efficiency, as the model can achieve comparable or superior learning outcomes with fewer data points.

5.3. Impact of Reconstruction with Different Compression Sizes

To assess the impact of the proposed VSCS algorithm on convergence speed and stability across varying compression scales, we conducted tests with dimensionalities set at 10, 15, 20, 25, 30, 35, 40, 45, 50, and 55. For each dimensionality, three sets of randomized trials were performed, with the average learning curves serving as the evaluative metric. The results of the learning curves are presented in Figure 5, and the algorithmic improvement rates are detailed in Table 3.
Figure 5 reveals that the training process experiences increased instability when the algorithm is compressed to scales of 10 and 20, which is attributable to excessive compression that results in the loss of substantial state information. Conversely, compression scales of 30, 40, and 50 demonstrate relative stability, with the scale of 30 yielding the most effective learning strategy.
Table 4 illustrates improvements in algorithm training efficiency for the VSCS algorithm at various compression scales, with the exception of scale 15, over the baseline algorithm. Notably, at scale 40, the VSCS algorithm required only 47 rounds to achieve the cumulative maximum reward for the first time—a 77.51 % increase in the rate of reaching a steady training state compared with the baseline. Furthermore, the learning performance of the VSCS algorithm was enhanced across all scales, showing an improvement rate exceeding 82 % . The scales of 30 and 45 demonstrated the most significant enhancements, with an improvement rate of 92.95 % in leaning performance compared with the baseline.

5.4. Reconstructed State Vector Similarity Analysis

In this analysis, we investigate the fidelity of state reconstruction by examining the similarity between the compressed and original states. We use the reconstruction distance to elucidate the reasons behind the enhanced training performance observed with reconstructed state vectors and introduce a threshold for reconstruction error tailored to the challenges of refinery crude oil storage and transportation scheduling. The experiment evaluates the encoder network of the VAE at compression scales of 10, 15, 20, 25, 30, 35, 40, 45, 50, and 55. We assess the congruence between 2048 original state samples and their reconstructed counterparts, which are produced by the decoder network, using Euclidean distance. The results, reflecting the similarity of output samples, are detailed in Table 5.
The encoder network with a compression scale of 30 demonstrates notable performance, yielding the highest mean similarity for reconstructed states. As detailed in Table 5, the arithmetic mean of similarity scores stands at 12.47, with a variance of 606.8.
After rigorous experimental analysis, it was determined that the reconstruction error threshold for refinery crude oil storage and transportation scheduling problems should be set at 12.47. This threshold implies that when the similarity distance falls below 12.47, the network is deemed to have achieved the standard of reconstruction.

5.5. Visual Analysis of Low-Dimensional Features

In this section, we delve into the characteristics of reconstructed states via low-dimensional visualization to elucidate the optimal effect achieved by compressing to 30 dimensions. The experiment involved reducing the dimensionality of 500 reconstructed state samples, across 10, 20, 30, 40, and 50 dimensions, down to a 2-dimensional plane using the UMAP technique [41]. We then observed the distribution of samples within this plane, employing cumulative average intracluster distance and intracluster density as metrics for quantitative analysis of the low-dimensional spatial formation. For the UMAP method, the approximate nearest-neighbor number parameter was set to 5, with the minimum interpoint distance parameter fixed at 0.3. The outcomes, displayed in Figure 6, reveal that in the two-dimensional space, the reconstructed states form clusters. Notably, the clusters at 30, 40, and 50 dimensions are more densely packed, whereas those at 10 and 20 dimensions exhibit greater dispersion.
The quantitative analysis, utilizing the cumulative average intracluster distance (outlined in Equation (17)) and the intracluster density (specified in Equation (18)), is detailed in Table 6. Throughout the analysis, five distinct parameter configurations were employed for the assessment of means. As evidenced by Table 6, within the 30-dimensional reconstruction, the cumulative average intracluster distance is recorded at 64.10, with an average intracluster density of 0.0968—both metrics represent the most favorable values among the five parameter sets examined. These findings indicate that the 30-dimensional reconstruction yields the most cohesive cluster structure within the sample distribution.
d avgdist = i D j D dist x i , x j / n D
d avgcent = i D dist x i , x center / n D

6. Conclusions

This study introduces the VSCS algorithm to expedite the training process of deep reinforcement learning models. The VSCS framework incorporates two key components: a low-dimensional feature generation module and a policy learning module. The former serves as a pretraining phase, leveraging a VAE to faithfully encapsulate the original state information within a reduced feature space. Upon completion of the training, the low-dimensional feature generation module integrates into the primary framework, furnishing the policy learning module with compact feature representations for policy network training. This synergistic approach facilitates end-to-end learning across both modules. A novel methodology was also developed to ascertain the optimal dimensionality for these low-dimensional features, accounting for reconstruction fidelity and visual analysis outcomes. A comprehensive experiment with the proposed method on the crude oil scheduling problem not only confirmed the efficacy of the framework but also empirically validated the optimal selection of low-dimensional feature dimensions.
The methodology presented herein primarily addresses the enhancement of performance in deep reinforcement learning when confronted with large-scale state representations. While it has yielded promising results, the prospect of its application within the industrial sector necessitates additional thorough investigation. Future research directives could include conducting generalizability studies on scheduling decisions across various refineries to solidify the method’s applicability and robustness in diverse industrial contexts.

Author Contributions

N.M., conceptualization, methodology, software, formal analysis, visualization, writing—original draft, and writing—review and editing; H.L. (Hongqi Li), supervision, writing—original draft, and writing—review and editing; H.L. (Hualin Liu), formal analysis, writing—original draft, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are not publicly available due to restrictions, their containing information that could compromise the privacy and interest of company.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xu, J.; Zhang, S.; Zhang, J.; Wang, S.; Xu, Q. Simultaneous scheduling of front-end crude transfer and refinery processing. Comput. Chem. Eng. 2017, 96, 212–236. [Google Scholar] [CrossRef]
  2. Jia, Z.; Ierapetritou, M.; Kelly, J.D. Refinery short-term scheduling using continuous time formulation: Crude-oil operations. Ind. Eng. Chem. Res. 2003, 42, 3085–3097. [Google Scholar] [CrossRef]
  3. Zheng, W.; Gao, X.; Zhu, G.; Zuo, X. Research progress on crude oil operation optimization. CIESC J. 2021, 72, 5481. [Google Scholar]
  4. Hamisu, A.A.; Kabantiok, S.; Wang, M. An Improved MILP model for scheduling crude oil unloading, storage and processing. In Computer Aided Chemical Engineering; Elsevier: Lappeenranta, Finland, 2013; Volume 32, pp. 631–636. [Google Scholar]
  5. Zhang, H.; Liang, Y.; Liao, Q.; Gao, J.; Yan, X.; Zhang, W. Mixed-time mixed-integer linear programming for optimal detailed scheduling of a crude oil port depot. Chem. Eng. Res. Des. 2018, 137, 434–451. [Google Scholar] [CrossRef]
  6. Furman, K.C.; Jia, Z.; Ierapetritou, M.G. A robust event-based continuous time formulation for tank transfer scheduling. Ind. Eng. Chem. Res. 2007, 46, 9126–9136. [Google Scholar] [CrossRef]
  7. Li, F.; Qian, F.; Du, W.; Yang, M.; Long, J.; Mahalec, V. Refinery production planning optimization under crude oil quality uncertainty. Comput. Chem. Eng. 2021, 151, 107361. [Google Scholar] [CrossRef]
  8. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
  9. Esteso, A.; Peidro, D.; Mula, J.; Díaz-Madroñero, M. Reinforcement learning applied to production planning and control. Int. J. Prod. Res. 2023, 61, 5772–5789. [Google Scholar] [CrossRef]
  10. Dong, Y.; Zhang, H.; Wang, C.; Zhou, X. Soft actor-critic DRL algorithm for interval optimal dispatch of integrated energy systems with uncertainty in demand response and renewable energy. Eng. Appl. Artif. Intell. 2024, 127, 107230. [Google Scholar] [CrossRef]
  11. Kuhnle, A.; Kaiser, J.P.; Theiß, F.; Stricker, N.; Lanza, G. Designing an adaptive production control system using reinforcement learning. J. Intell. Manuf. 2021, 32, 855–876. [Google Scholar] [CrossRef]
  12. Park, J.; Chun, J.; Kim, S.H.; Kim, Y.; Park, J. Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning. Int. J. Prod. Res. 2021, 59, 3360–3377. [Google Scholar] [CrossRef]
  13. Yang, X.; Wang, Z.; Zhang, H.; Ma, N.; Yang, N.; Liu, H.; Zhang, H.; Yang, L. A review: Machine learning for combinatorial optimization problems in energy areas. Algorithms 2022, 15, 205. [Google Scholar] [CrossRef]
  14. Ogunfowora, O.; Najjaran, H. Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization. J. Manuf. Syst. 2023, 70, 244–263. [Google Scholar] [CrossRef]
  15. Hamisu, A.A.; Kabantiok, S.; Wang, M. Refinery scheduling of crude oil unloading with tank inventory management. Comput. Chem. Eng. 2013, 55, 134–147. [Google Scholar] [CrossRef]
  16. Shah, N. Mathematical programming techniques for crude oil scheduling. Comput. Chem. Eng. 1996, 20, S1227–S1232. [Google Scholar] [CrossRef]
  17. Pinto, J.M.; Joly, M.; Moro, L.F.L. Planning and scheduling models for refinery operations. Comput. Chem. Eng. 2000, 24, 2259–2276. [Google Scholar] [CrossRef]
  18. Zimberg, B.; Ferreira, E.; Camponogara, E. A continuous-time formulation for scheduling crude oil operations in a terminal with a refinery pipeline. Comput. Chem. Eng. 2023, 178, 108354. [Google Scholar] [CrossRef]
  19. Su, L.; Bernal, D.E.; Grossmann, I.E.; Tang, L. Modeling for integrated refinery planning with crude-oil scheduling. Chem. Eng. Res. Des. 2023, 192, 141–157. [Google Scholar] [CrossRef]
  20. Castro, P.M.; Grossmann, I.E. Global optimal scheduling of crude oil blending operations with RTN continuous-time and multiparametric disaggregation. Ind. Eng. Chem. Res. 2014, 53, 15127–15145. [Google Scholar] [CrossRef]
  21. Assis, L.S.; Camponogara, E.; Menezes, B.C.; Grossmann, I.E. An MINLP formulation for integrating the operational management of crude oil supply. Comput. Chem. Eng. 2019, 123, 110–125. [Google Scholar] [CrossRef]
  22. Assis, L.S.; Camponogara, E.; Grossmann, I.E. A MILP-based clustering strategy for integrating the operational management of crude oil supply. Comput. Chem. Eng. 2021, 145, 107161. [Google Scholar] [CrossRef]
  23. Zimberg, B.; Camponogara, E.; Ferreira, E. Reception, mixture, and transfer in a crude oil terminal. Comput. Chem. Eng. 2015, 82, 293–302. [Google Scholar] [CrossRef]
  24. Ramteke, M.; Srinivasan, R. Large-scale refinery crude oil scheduling by integrating graph representation and genetic algorithm. Ind. Eng. Chem. Res. 2012, 51, 5256–5272. [Google Scholar] [CrossRef]
  25. Hou, Y.; Wu, N.; Zhou, M.; Li, Z. Pareto-optimization for scheduling of crude oil operations in refinery via genetic algorithm. IEEE Trans. Syst. Man Cybern. Syst. 2015, 47, 517–530. [Google Scholar] [CrossRef]
  26. Hou, Y.; Wu, N.; Li, Z.; Zhang, Y.; Qu, T.; Zhu, Q. Many-objective optimization for scheduling of crude oil operations based on NSGA-III with consideration of energy efficiency. Swarm Evol. Comput. 2020, 57, 100714. [Google Scholar] [CrossRef]
  27. Ramteke, M.; Srinivasan, R. Integrating graph-based representation and genetic algorithm for large-scale optimization: Refinery crude oil scheduling. In Computer Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2011; Volume 29, pp. 567–571. [Google Scholar]
  28. Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Blundell, C. Agent57: Outperforming the atari human benchmark. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 507–517. [Google Scholar]
  29. Hubbs, C.D.; Li, C.; Sahinidis, N.V.; Grossmann, I.E.; Wassick, J.M. A deep reinforcement learning approach for chemical production scheduling. Comput. Chem. Eng. 2020, 141, 106982. [Google Scholar] [CrossRef]
  30. Gui, Y.; Tang, D.; Zhu, H.; Zhang, Y.; Zhang, Z. Dynamic scheduling for flexible job shop using a deep reinforcement learning approach. Comput. Ind. Eng. 2023, 180, 109255. [Google Scholar] [CrossRef]
  31. Che, G.; Zhang, Y.; Tang, L.; Zhao, S. A deep reinforcement learning based multi-objective optimization for the scheduling of oxygen production system in integrated iron and steel plants. Appl. Energy 2023, 345, 121332. [Google Scholar] [CrossRef]
  32. Lee, Y.H.; Lee, S. Deep reinforcement learning based scheduling within production plan in semiconductor fabrication. Expert Syst. Appl. 2022, 191, 116222. [Google Scholar] [CrossRef]
  33. Yang, F.; Yang, Y.; Ni, S.; Liu, S.; Xu, C.; Chen, D.; Zhang, Q. Single-track railway scheduling with a novel gridworld model and scalable deep reinforcement learning. Transp. Res. Part Emerg. Technol. 2023, 154, 104237. [Google Scholar] [CrossRef]
  34. Pan, L.; Cai, Q.; Fang, Z.; Tang, P.; Huang, L. A deep reinforcement learning framework for rebalancing dockless bike sharing systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1393–1400. [Google Scholar]
  35. Yan, Q.; Wang, H.; Wu, F. Digital twin-enabled dynamic scheduling with preventive maintenance using a double-layer Q-learning algorithm. Comput. Oper. Res. 2022, 144, 105823. [Google Scholar] [CrossRef]
  36. Chen, Y.; Liu, Y.; Xiahou, T. A deep reinforcement learning approach to dynamic loading strategy of repairable multistate systems. IEEE Trans. Reliab. 2021, 71, 484–499. [Google Scholar] [CrossRef]
  37. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  38. Zang, W.; Song, D. Energy-saving profile optimization for underwater glider sampling: The soft actor critic method. Measurement 2023, 217, 113008. [Google Scholar] [CrossRef]
  39. Hussain, A.; Bui, V.H.; Musilek, P. Local demand management of charging stations using vehicle-to-vehicle service: A welfare maximization-based soft actor-critic model. eTransportation 2023, 18, 100280. [Google Scholar] [CrossRef]
  40. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  41. McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Figure 1. Schematic diagram of refinery crude oil scheduling scenario.
Figure 1. Schematic diagram of refinery crude oil scheduling scenario.
Mathematics 12 00393 g001
Figure 2. Framework diagram of the proposed VSCS algorithm.
Figure 2. Framework diagram of the proposed VSCS algorithm.
Mathematics 12 00393 g002
Figure 3. Framework diagram of the low-dimensional feature generation module.
Figure 3. Framework diagram of the low-dimensional feature generation module.
Mathematics 12 00393 g003
Figure 4. Learning curves of comparison methods. The solid lines show the means of 10 trials, and lighter shading shows standard errors.
Figure 4. Learning curves of comparison methods. The solid lines show the means of 10 trials, and lighter shading shows standard errors.
Mathematics 12 00393 g004
Figure 5. Comparison of low-dimensional feature reconstruction performance in different dimensions. The solid lines show the means of 3 trials, and lighter shading shows standard errors.
Figure 5. Comparison of low-dimensional feature reconstruction performance in different dimensions. The solid lines show the means of 3 trials, and lighter shading shows standard errors.
Mathematics 12 00393 g005
Figure 6. Results of sample dimensionality reduction visualization: (a) 10-dimensional sample dimensionality reduction visualization, (b) 20-dimensional sample dimensionality reduction visualization, (c) 30-dimensional sample dimensionality reduction visualization, (d) 40-dimensional sample dimensionality reduction visualization, (e) 50-dimensional sample dimensionality reduction visualization.
Figure 6. Results of sample dimensionality reduction visualization: (a) 10-dimensional sample dimensionality reduction visualization, (b) 20-dimensional sample dimensionality reduction visualization, (c) 30-dimensional sample dimensionality reduction visualization, (d) 40-dimensional sample dimensionality reduction visualization, (e) 50-dimensional sample dimensionality reduction visualization.
Mathematics 12 00393 g006
Table 1. The scale and performance of traditional research methods in crude oil scheduling.
Table 1. The scale and performance of traditional research methods in crude oil scheduling.
TechniqueScalePerformance
discrete-time MILP framework [16]Four crude types, two CDUs, seven refinery tanks, and eight portside tanks; the time horizon of operation is one month, and a discretization interval of one day is usedin a few minutes
continuous and discrete temporal MILP [17]Three CDUs, six storage tanks, and three oil pipelines; the time horizon of operation is one day, at every hourin reasonable time
continuous-time MINLP [1]One single docking berth, four storage tanks, four charging tanks, and two CDUs; the time horizon of operation is 15 days25.94 s
Many-objective optimization for scheduling of crude oil operations based on NSGA-III [26]There are three distillers with nine charging tanks and a long-distance pipeline; every time, it needs to produce a 10-day scheduleabout 100 s–150 s
MILP framework with rolling horizon strategy [23]Eight tanks, where one tank is assumed in maintenance, five crude qualities; the time horizon is 31 or 61 days (periods)less than 5 min
Table 2. Main experimental parameters.
Table 2. Main experimental parameters.
ModelNumber of NeuronsNumber of Hidden LayersOptimizerDiscount FactorLearning RateSoft Update CoefficientBatch SizeEntropy ThresholdExperience Buffer Size
Policy learning module5125Adam [40]0.990.030.0051280.9100,000
Low−dimensional feature generation module401Adam [40]
Table 3. Results of Comparison Methods.
Table 3. Results of Comparison Methods.
Iterations for Maximum RewardFinal RewardTraining Time to Steady State
SAC209−27,540,217305
VSCS47−2,942,59478
Improvement Rate (%)77.589.374.4
Table 4. The VSCS algorithm improvement rate analysis.
Table 4. The VSCS algorithm improvement rate analysis.
Feature DimensionIterations for Steady StateConvergence Speed Improvement RateFinal RewardReward Improvement Rate
VSCS (10)14829.19%−4,040,69485.33%
VSCS (15)215−2.87%−1,980,77292.81%
VSCS (20)15824.40%−2,143,44892.22%
VSCS (25)13435.89%−3,493,72487.31%
VSCS (30)14729.67%−1,940,76292.95%
VSCS (35)17018.66%−2,942,59489.32%
VSCS (40)4777.51%−2,991,34889.14%
VSCS (45)8758.37%−1,941,36492.95%
VSCS (50)10549.76%−4,876,38382.29%
Table 5. Reconstruction distance analysis.
Table 5. Reconstruction distance analysis.
Dimensionality55504540353025201510
Arithmetic Mean12.6612.7212.6612.6112.6712.4712.5212.5512.5312.54
Maximum146.93149.13141.87145.71149.14146.29141.77136.13134.52139.56
Minimum0.230.270.330.440.380.560.611.171.360.59
Variance608.91620.99617.28608.06619.23606.81611.70614.83614.93611.88
Standard Deviation24.6724.9224.8524.6624.8824.6324.7324.8024.8024.74
Median4.194.174.044.063.993.883.903.793.733.86
Table 6. Quantitative analysis of visualization.
Table 6. Quantitative analysis of visualization.
5040302010
Intracluster Cumulative Distance (5, 0.3)75.0773.0471.8879.5390.4
Intracluster Cumulative Distance (5, 0.15)66.3269.1465.772.6783.5
Intracluster Cumulative Distance (10, 0.15)57.8157.7757.5561.1770.1
Intracluster Cumulative Distance (10, 0.10)56.3855.455.1459.0267.3
Intracluster Cumulative Distance (10, 0.50)71.2571.1370.2275.7486.35
Average Intracluster Cumulative Distance65.3765.3064.1069.6379.53
Intracluster Density (10, 0.50)0.1040.1040.1020.1090.124
Intracluster Density (10, 0.15)0.0830.0820.0820.0860.101
Intracluster Density (5, 0.3)0.1080.1040.1040.1130.131
Intracluster Density (5, 0.15)0.0940.0990.0930.1050.124
Average Intracluster Density0.09840.09840.09680.10420.1208
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, N.; Li, H.; Liu, H. State-Space Compression for Efficient Policy Learning in Crude Oil Scheduling. Mathematics 2024, 12, 393. https://doi.org/10.3390/math12030393

AMA Style

Ma N, Li H, Liu H. State-Space Compression for Efficient Policy Learning in Crude Oil Scheduling. Mathematics. 2024; 12(3):393. https://doi.org/10.3390/math12030393

Chicago/Turabian Style

Ma, Nan, Hongqi Li, and Hualin Liu. 2024. "State-Space Compression for Efficient Policy Learning in Crude Oil Scheduling" Mathematics 12, no. 3: 393. https://doi.org/10.3390/math12030393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop