1. Introduction
Control systems play a pivotal role in the process industry by ensuring the smooth operation of complex production processes. These systems typically consist of several key components, each serving a specific purpose. Monitors, for instance, provide real-time data visualization and process status information, giving operators critical insights into the ongoing processes. Another crucial component is the controller, which oversees and regulates individual devices and process variables, ensuring that they function optimally. Additionally, network switchers are instrumental in facilitating seamless communication and data exchange among the various components of the control system. The significance of control systems extends across various application domains, including the chemical industry [
1], oil and gas production [
2], and the power industry [
3]. Consequently, the security of control systems has gained increased and more extensive attention.
Existing work on security insurance of control systems has focused on protecting them from cyberattacks. Jarmo Alnen et al. [
4] introduced a cybersecurity risk analysis method that relies on the proposed ontology. This method provides a structured data repository aimed at bolstering the assessment process and establishing traceability between the generated artifacts. Angelo Corallo et al. [
5] delved into the cybersecurity challenges encountered in the realm of the Internet of Everything during the Industry 4.0 era. Certainly, the increased efforts in terms of cybersecurity to protect control systems are indeed noteworthy; however, the broad issue of security throughout the entire production process often remains inadequately addressed. For instance, in 2018, Tesla faced significant challenges with its highly automated production line for the Model 3. The company employed an extensive fleet of robots equipped with sensors to aid in component assembly. Unfortunately, a few robots experienced sensor failures or misalignment issues, leading to production delays. These disruptions stemming from sensor-related issues caused production bottlenecks, resulting in a slower output of Model 3 vehicles—a pivotal product for Tesla. Consequently, these setbacks led to delivery delays for customers and carried financial implications for the company. The reality is that sensor failures in industrial processes can result in substantial loss of both human life and property.
Therefore, a more comprehensive strategy should not only focus on strengthening control system defenses, but also prioritize the implementation of robust security measures throughout the entire process industry.
In order to aptly represent the comprehensive protection of critical infrastructure, we have coined the term ’dual-security’ to encompass both the security of cyberspace and the industrial process. Since a substantial portion of prior research has concentrated on cybersecurity, the challenge of simultaneously addressing dual-security risks within the process industry has remained an uncharted territory.
Recently, knowledge graphs (KGs) have emerged as a fundamental technique for incorporating knowledge from diverse fields. It provides a structured representation of knowledge, i.e. factual triples in the form of
, denoting the head and tail entities and the relationship between them [
6]. For instance, in a KG, we may encounter entities like “Albert Einstein” and “Theory of Relativity” connected by the relationship “developed”. These KGs are constructed by combining information from various sources, including structured databases, unstructured text, and linked open data [
7]. The structured nature of KGs ensures semantic consistency, enabling machines to not only retrieve information, but also understand the context and perform sophisticated reasoning tasks. KGs have found applications in a wide range of domains, from natural language processing [
8,
9,
10] and information retrieval to recommendation systems [
11,
12] and medical healthcare [
13,
14,
15].
Link prediction is a fundamental subfield within knowledge representation and reasoning, dedicated to the task of inferring missing facts and relationships within the graph. Addressing the link prediction challenges has led to the development of various distinct approaches, among which the primary category is embedding-based methods. This kind of method is designed to acquire distributed low-dimension vector representations, commonly referred to as embeddings, for entities and relations within the KG. Most representative approaches within this research paradigm encompass TransE [
16], DistMult [
17], ConvE [
18], TuckER [
19], etc.
Link prediction on knowledge graphs serves as an effective and transparent approach for revealing hidden knowledge within knowledge graphs. In this paper, we study the problem of protecting the process industry in terms of dual-security. Specifically, we collect knowledge from both security aspects and build a dual-security knowledge graph. Then, we propose to discover potential security risks through inference on the KG with an embedding-based link prediction method. Empirical experiments substantiate the efficacy of our method, as it achieves good performance on the real-world dataset and mines meaningful dual-security risks. In addition to conducting comprehensive dataset experiments, we also performed an analysis of the risk prediction capability of existing models using experiments conducted on crack hydrogenation industrial processes. Our analysis showed that the employed approach is more accurate in predicting the potential risks of unitary industrial processes compared to other methods.
This paper contributes to two main aspects. Firstly, it innovatively integrates cybersecurity and industrial process security, extracting knowledge from both dimensions, into a knowledge graph. Secondly, it employs a relatively efficient method for link prediction, enabling it to mine undiscovered knowledge from the knowledge graph and effectively mitigate potential risks in industrial processes.
2. Related Work
Cybersecurity knowledge graphs (CSKG) [
20] represent a specialized category of knowledge graphs tailored for the cybersecurity domain. They comprise entities and relations extracted from a myriad of attack and defense scenarios within the cybersecurity landscape [
21,
22]. Furthermore, abstract concepts such as ‘attacker’, ‘attack pattern’, and ‘vulnerability’ can be seamlessly represented as entities, while the paths of attacks and other connections are aptly captured as relations. CSKG adeptly employs knowledge graph construction techniques to extract and effectively integrate pre-existing knowledge from various security data sources, ensuring a comprehensive and unified understanding of the cybersecurity landscape. The link prediction of CSKG has three major applications in cybersecurity, namely the attack prediction [
23], threat hunting [
24] and intrusion detection [
25,
26,
27]. However, there is an inadequate focus on process safety.
The application of knowledge graphs in the industry can be partitioned into two distinct phases, namely the construction and deduction periods, as indicated by Li et al. [
28]. In the construction phase, the primary objective revolves around the integration of multiple text mining and machine learning tools to process raw data, yielding triples of the form
for the knowledge graph. As a result, Natural Language Processing (NLP) techniques [
29,
30,
31] and associated toolkits are frequently harnessed to automatically extract entities from various unstructured knowledge resources. Notably, the utilization of knowledge graphs in industry transcends the conventional dissemination of existing knowledge items; instead, it caters to the elevated demands for synthesis and innovation within the industrial domain. Consequently, the deduction phase of industrial knowledge graphs strives to meet these requirements. Knowledge deduction can be further subcategorized into attribution prediction [
32] and link prediction [
33,
34,
35]. More specifically, the deduction process can be effectively reformulated into a series of matrix manipulations [
36,
37,
38]. This transformation is achieved by vectorizing the entities and relations, aligning them according to their semantic and topological features within the KG. Such an approach not only enhances the computational efficiency, but also leverages the inherent structural properties of the KG for knowledge deduction. Nonetheless, the majority of the domain-specific knowledge graphs primarily focus on industrial products and services, and they often fall short in addressing the crucial aspect of industrial security.
Our work is related to the task of link prediction in dual-security KG; previous efforts in link prediction can be divided into two main aspects, embedding-based prediction and rule-based prediction. Former approaches focused on learning low-dimension representation for entities and relations. Related methods include TransE [
16], TransH [
39], ComplEx [
40], etc., and detect facts by projecting entities and relations into a semantic space and conduct algebraic operations on the space. Specifically, TransE [
16] represents the triples into d-dimensional space,
and generates embeddings following the translational principle
. TransH [
39] assigns a hyperplane for each relation to satisfy the N-to-N relationship scenarios. ComplEx [
40] firstly uses complex vector space to generate embeddings. Through the space, we can capture both symmetric and antisymmetric relations,
,
h can be formulated as
where
and
are real and imaginary parts of
h, respectively. On the other hand, the latter(rule-based) approaches do not encounter these issues, including NeuralLP [
41], DRUM [
42]. They leverage first-order logic and rule-based approaches to infer missing relationships and facts. NeuralLP [
41] combines neural networks with logical rules and utilizes a differentiable framework to learn the rules from data, allowing for the incorporation of prior domain knowledge. The key idea behind NeuralLP is to bridge the gap between symbolic reasoning and neural network learning, making it an attractive approach for rule-based knowledge graph completion. DRUM [
42] integrates rules into an end-to-end differentiable neural network. This approach facilitates rule-based reasoning while benefiting from the learning capabilities of neural networks and extends the knowledge graph with rule-instantiated triples, enabling the incorporation of logical rules in the knowledge graph completion process.
3. Methodology
In this section, we will commence by presenting essential background knowledge, encompassing key concepts such as knowledge graphs, link prediction, and the fundamental steps involved in embedding-based methods for link prediction. Subsequently, we will delve into an extensive exposition of the best-performing model, namely PairRE.
3.1. Preliminaries
Knowledge graph is a structured representation of knowledge, where and denote the set of entities and relations, respectively. It functions as a semantic network that organizes information into triples, denoted as , where and . These triples take the form of , presenting a graph-like structure.
Link prediction task refers to a key task in knowledge graph completion. It involves predicting missing triples by identifying the most plausible tail entity from the set of entities for a given incomplete triple . During inference, when provided with h and r as inputs, the model computes a ranked list of all entities, with higher scores indicating higher rankings. In essence, link prediction is a supervised learning technique that aims to minimize the dissimilarity between the predicted entity and the ground truth provided in the available data.
Embedding-based methods for link prediction generally consist of three key steps. In the first step, we initialize embeddings for all entities and relations. To achieve this, we establish an index table, wherein the indexes are associated with entity and relation IDs, and the table’s entries comprise stochastic vectors that represent the embeddings. The second step involves the formulation of an objective function designed to assess the score of a triple within a low-dimensional space. This objective function can be realized through various methods, such as TransE, ComplEX, TuckER, and others. The final step is training and inference. During training, the model is trained, often using negative samples to ensure its robustness and accuracy. Negative samples are essentially fictitious triples constructed at random. These fictive triples share the same h and r with the factual triple, but the t is substituted with a different entity. We anticipate that the scores of true triples will be higher than those of the negative samples. Adhering to this principle, the model iteratively optimizes the embedding tables until convergence is achieved. During inference, the process typically entails taking the embeddings of h and r as inputs, followed by the computation of the score for across all potential entities . Higher scores indicate a higher probability of a connection between entities h and through the relation r. This approach enables effective predictions and assessments.
3.2. The Link Prediction Model
We now advocate the utilization of PairRE [
43] for augmenting the dual-security knowledge graph. This choice is motivated by two primary factors: Firstly, the data in the graph encompass diverse complex relations, including N-to-1, 1-to-N, and N-to-N relationships. Secondly, the data exhibit various relation patterns, including symmetry and antisymmetry. These intricacies pose a unique challenge that other existing methods are ill-equipped to address simultaneously. However, PairRE stands out by employing paired embedding vectors for each relation representation, allowing for adaptable margin adjustments in the loss function. This adaptability is pivotal in effectively accommodating complex relations and addressing the challenges presented by the dual-security knowledge graph.
PairRE adopts a unique approach by acquiring paired representations for each relation. For a given training triple
, the model learns vector embeddings for entities and relations within real space. Specifically, it treats relation embedding as a pair of vectors, denoted as
. These paired vectors are responsible for projecting the head entity
h and the tail entity
t into the Euclidean space, with the projection operation being an element-wise product between the two vectors. Furthermore, the method calculates the distance between these two embeddings to gauge the plausibility of the triple. The desired outcome is that
when the triple
holds true. According to the calculated measure, the scoring function is defined as follows:
where
with
h and
t subject to specific constraints, such as
.
To further optimize the model, we employ a contrastive learning measure, specifically adopting the self-adversarial negative sampling loss [
44] as the training objective:
In this equation,
represents a fixed margin, and
denotes the sigmoid function.
corresponds to the
ith negative triple, while
signifies the weight assigned to this negative sample. The weight,
, is calculated using the softmax function:
An illustration depicting the structure of this paper is presented in
Figure 1. First, we extract entities and relations from the extensive documentation of industrial processes through manual annotation. Next, we employ this information to construct a dual-security knowledge graph, incorporating data from a vulnerability database. Finally, we utilize the PairRE method for the link prediction task, enabling the prediction of potential risks in the dual-security domain.
4. Experiments
This section is organized into three subheadings. Firstly, we provide a detailed overview of our experimental setup, which includes a description of the dataset, evaluation metrics, and other relevant details. Secondly, we present a concise and accurate analysis of the experimental results. Finally, we draw insightful interpretations from these results, offering our conclusions based on the experiments.
4.1. Experiment Setting
Datasets, we culminated in the construction of our knowledge graph, which is founded on real-world resources and encompasses impressive 37,000 triples. We can find detailed statistics about the dataset in
Table 1. Additionally, to showcase the model’s proficiency in predicting potential risks within specific industrial processes, such as crack hydrogenation, we generated a focused sub-KG. This sub-KG encapsulates the knowledge related to the crack hydrogenation process alongside all relevant cybersecurity information. Refer to
Table 2 for detailed statistics about this sub-KG. To optimize the data for our link prediction task, we meticulously partitioned it into three distinct subsets, ensuring a balanced division. Specifically, we allocated the data into a test set, a training set, and a validation set, maintaining a proportional split ratio of 1:8:1.
Evaluation metric, in the case of each query of the form or , we calculate a score for every entity while simultaneously assessing the rank of the correct answer. To evaluate the performance of our models, we employ two widely-used metrics, namely the Mean Reciprocal Rank (MRR) and Hit@k. MRR is computed as the mean of the reciprocal ranks for the answer entities. Hit@k, on the other hand, measures the percentage of the desired entities correctly ranked within the top k positions. Furthermore, it is important to note that these metrics exclude the scores of all known true triples in the training, validation, and testing sets. This exclusion allows for a more robust evaluation of our models, providing valuable insights into their effectiveness.
Comparison of models, in our experiments, we applied our dataset to various models, primarily focusing on comparing their performance. Given the substantial scale of our dataset, we employed embedding-based methods to conduct our experiments. These methods employ various embedding techniques to represent entities and relations in low-dimensional spaces, which include methods such as TransE [
16], TransR [
45], TransH [
39], HAKE [
46], PairRE [
43]. We finally compared their performance with each other.
Hyperparameters, we maintained consistent hyperparameter settings during training. These settings entailed a dimension of embedding vectors set to 512 and a maximum of 1000 training epochs. Concurrently, other methods like TuckER [
19], which decomposes a tensor into a core tensor with three-factor matrices, and ConvE [
18], founded on the graph neural networks framework, we carefully selected the most suitable hyperparameters through multiple experiments. In this scenario, we set the number of graph convolutional network layers to 1, fixed the dimension of embedding vectors to 200, and established the maximum number of training epochs as 2000, among other key parameters.
4.2. Results on Link Prediction
All methods were evaluated across evaluation metrics on the dual-security dataset, and the results are presented in
Table 3. Notably, these evaluations yield clear observations on the performance of these methods. It is apparent that the graph convolutional network method (ConvE) excels and outperforms some of the traditional embedding-based methods, including TransE, TransR, TransH, and TruckER. This observed superiority can be attributed to the incorporation of graph structures, allowing these models to harness a more comprehensive understanding of the training data—a critical aspect for effective link prediction tasks. However, it is worth mentioning that recently proposed methods, such as HAKE and PairRE, manage to outperform even the graph convolutional network methods. This superior performance may be attributed to the adoption of more intricate embedding techniques and measures, underscoring the benefits of sophisticated embedding approaches in addressing dual-security link prediction challenges.
While PairRE demonstrates outstanding overall performance on the dual-security dataset, achieving a commendable level of accuracy, its direct hit rate remains relatively modest at just . This observation prompts a vital question regarding the ability of existing models to effectively absorb and adapt to the security data within the dual-security dataset. The dual nature of the KG, incorporating both cybersecurity and industrial process security knowledge, results in a semantic space that is notably vast, rendering it challenging to learn comprehensively. It is imperative to acknowledge that this dataset inherently represents the confluence of two distinct dimensions. Therefore, we deduce that the existing models’ deficiencies arise from their failure to adequately learn the knowledge encompassed by both aspects in the dual-KG.
The link prediction results for the sub-KG are presented in
Table 4. Notably, PairRE demonstrates efficient prediction of potential risks, with over half of the test data producing accurate predictions. In addition,
of the test data achieves a top-three ranking, and nearly all test data predictions secure a spot within the top ten. Consequently, focusing on the crack hydrogenation process, obtaining the top three predicted results for all devices would thwart three quarters of potential attacks. Moreover, achieving the top ten predicted results would provide comprehensive protection for almost all devices.
After comparing the sub-KG performance with the whole dataset, we found that PairRE is excellent at predicting potential risks in a single scenario. This approach enables full absorption of knowledge within the sub-KG. So, we could generate suitable sub-KG for various specific industrial processes that can be safeguarded against potential security risks by applying this approach.
4.3. Mined Knowledge
By utilizing the best-performing model, PairRE, we have achieved successful predictions of various cybersecurity attacks and industrial process anomalies, as evidenced in
Table 5. This approach allows us not only to identify diverse cyber virus attack patterns, but also to access various methods for mitigating cyber risks in the field of cybersecurity. Additionally, we can extend our capabilities to complete the industrial process knowledge graph, enabling us to uncover potential risks within this domain.
To demonstrate the risk prediction ability of our approach, we generate a sub-KG representing the crack hydrogenation process and all cybersecurity knowledge. Furthermore, we present the mined knowledge regarding the crack hydrogenation process and Industroyer attacks, as depicted in
Figure 2 and summarized in
Table 6. The approach successfully identified vulnerable devices, specifically
D-2106,
D-2104, and
K-2102
A/
B, under Industroyer attacks. These devices are classified as power supply equipment. Given that Industroyer primarily targets power systems, our approach can precisely identify potentially vulnerable devices. In addition, Industroyer primarily employs remote system discovery to detect a list of devices. Hence, remote system discovery, which includes malicious emails, the office network system, and external O&M terminals, has been predicted for all devices. Subsequently, it identifies target devices and executes various attacks. Depending on the function of different devices, the focus of Industroyer attacks varies. Therefore, the potential risks predicted by the approach differ, as shown in
Table 6. However, these identified risks strongly correlate with the devices’ functionality. For example, the
K-2102
A/
B serves as the backbone node of the power network. Consequently, attacks against it primarily concentrate on the control aspect, encompassing actions like blocking serial COM, stopping services, and obstructing command messages. In summary, the approach excels in not only identifying vulnerable devices in the industrial process, but also mining potential risks based on the functions of the devices and the adversaries’ targets.
5. Conclusions
In this paper, we begin with constructing a dual-security knowledge graph, which innovatively integrates industrial process security information, a domain often overlooked by researchers, into the cybersecurity knowledge graph, a domain that has garnered widespread attention. Subsequently, we employ an embedding-based method (RairRE) capable of learning the underlying semantics in the knowledge graph, allowing us to predict the occurrence of novel cybersecurity attack patterns or potential security issues in the industrial process. The experimental results suggest that this method can effectively mine potential risks in some industrial process. However, the overall potential risk prediction accuracy on the dual-KG is not very satisfactory, as existing models struggle to comprehensively grasp the knowledge embedded in both aspects of the whole dataset.
In light of these insights, our proposed model currently used has the potential for improved performance in fully tapping into the rich, interrelated data contained within the knowledge graph.
Our future endeavors will be dedicated to the development of a more tailored dual-security mining model, specifically designed to extract and integrate all knowledge with greater precision from both the realms of cyber and industrial process security. The ultimate goal is to enhance the accuracy of predicting potential risks in industrial processes.