Next Article in Journal
Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators
Next Article in Special Issue
Intrusion Detection Based on Adaptive Sample Distribution Dual-Experience Replay Reinforcement Learning
Previous Article in Journal
A New Chaos-Based Image Encryption Algorithm Based on Discrete Fourier Transform and Improved Joseph Traversal
Previous Article in Special Issue
Privacy-Enhanced Federated Learning for Non-IID Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism

1
Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China
2
Big Data Center of State Grid Corporation of China, Xicheng District, Beijing 100052, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(5), 640; https://doi.org/10.3390/math12050640
Submission received: 3 February 2024 / Revised: 16 February 2024 / Accepted: 19 February 2024 / Published: 22 February 2024
(This article belongs to the Special Issue Advanced Research on Information System Security and Privacy)

Abstract

:
Domain Name System (DNS) plays an infrastructure role in providing the directory service for mapping domains to IPs on the Internet. Considering the foundation and openness of DNS, it is not surprising that adversaries register massive domains to enable multiple malicious activities, such as spam, command and control (C&C), malware distribution, click fraud, etc. Therefore, detecting malicious domains is a significant topic in security research. Although a substantial quantity of research has been conducted, previous work has failed to fuse multiple relationship features to uncover the deep underlying relationships between domains, thus largely limiting their level of performance. In this paper, we proposed AGCN-Domain to detect malicious domains by combining various relations. The core concept behind our work is to analyze relations between domains according to their behaviors in multiple perspectives and fuse them intelligently. The AGCN-Domain model utilizes three relationships (client relation, resolution relation, and cname relation) to construct three relationship feature graphs to extract features and intelligently fuse the features extracted from the graphs through an attention mechanism. After the relationship features are extracted from the domain names, they are put into the trained classifier to be processed. Through our experiments, we have demonstrated the performance of our proposed AGCN-Domain model. With 10% initialized labels in the dataset, our AGCN-Domain model achieved an accuracy of 94.27% and the F1 score of 87.93%, significantly outperforming other methods in the comparative experiments.

1. Introduction

The Domain Name Resolution System (DNS) is one of the key infrastructures of today’s Internet, which provides domain name resolution services for Internet users and enables the mapping of domain names to IPs. However, due to the openness of DNS services, cyber attackers also rely on DNS services to carry out attacks by registering a large number of domain names. These malicious domains can lead to web pages that the attacker has carefully designed to conduct crimes such as stealing accounts, fraud, and infecting computers with viruses (ransomware, Trojans, etc.). These types of web pages are almost identical to normal web pages, thus reducing the vigilance of ordinary users and providing attackers with opportunity. In addition, some malicious software, such as Trojans and backdoor viruses, connect to C&C servers through domain names (C&C servers act as bridges for communication between attackers and infected hosts. Attackers send instructions to compromised hosts through C&C servers, enabling lateral movement to infect other internal hosts or steal sensitive data). These forms of malicious software use fast-flux technology to evade detection, making it increasingly difficult to detect malicious domains. Therefore, the detection of malicious domains has attracted the attention of many researchers.
Researchers have proposed a large number of works to discover malicious domains. Traditional approaches can be classified as feature-based methods [1,2,3,4,5,6,7,8,9,10,11,12]. They usually extract features provided by security experts for each domain from DNS data [1,2,3,4,5,6,7]. Typical features include lexical features, whois features, resolution features, etc. They recognize malicious domains with trained classifiers. However, such methods can be evaded easily since most of them are based on features of an individual domain. In addition, they are limited in expert observation and knowledge with manual prejudice.
In the face of these problems, some researchers have turned to the use of relational features between domain names for malicious domain name detection. They built various graphs to describe relations like client relations [13,14], resolution relations [15], whois relations [16], etc. And then they discovered malicious domains based on the intuition that domains strongly related to known malicious domains are more likely to be malicious. However, in some studies they focused on only one type of relationship between domains, which left a great deal of information unexplored and failed to reveal the richer potential relationships between domains, which allowed experienced attackers to evade detection by this type of system. Some studies utilized multiple relationships [17,18,19] to calculate domain name similarity through relationships [17] or to discover domain names strongly associated with malicious domains through multiple mete-paths [18]. But these works ignored the fine-grained distinctions between relationship features and ceased to address the different effects of different relationship features on domain name judgements with implications for the accuracy of the system.
To address the problems with the above two main ways of detecting malicious domain names, it is necessary to develop a more comprehensive malicious domain name detection system. Graph Convolutional Neural Network is a good solution in solving graph-based correlation problems, and it also has good performance in collaborative classification, machine account detection, etc.
In this paper, we made a basic observation about the relational characteristics of domain names that domain names with strong associations are more likely to belong to the same category [15]. This is because the attacker’s attack resources are limited due to cost, making the adversary reuse their attack resources, leading to a strong correlation of malicious domains in the relationship [18]. Based on the above basic observation, we propose AGCN-Domain, a system that detects malicious domains by leveraging various relations and balancing their influence with the attention mechanism to solve the above limitation. In the AGCN-Domain model, firstly, we used three homogeneous subgraphs: client relation graph, resolution relation graph, and cname relation graph to represent the relationship features between domains and pre-process the subgraphs to optimize the subgraph structure. Graph convolution neural networks were implemented to aggregate features in each of the three subgraphs. Second, to balance the impact of different relationship features on domain features, we introduced an attention mechanism to make the detection results more accurate. We utilized a soft attention mechanism to intelligently assign weight coefficients to the relational features of the domain converged on the three subgraphs, thus obtaining a comprehensive relational feature vector of the domain. Finally, a fully connected layer was used to output the malicious/benign probability for the domain.
The contributions can be summarized as follows:
  • We proposed AGCN-Domain, a system for detecting malicious domain names using multiple relationship patterns: Client relation, resolution relation, and cname relation, which could extract and fuse the relationship features of domain names. The malicious domain names detection problem was transformed into a node binary prediction problem.
  • We designed a mechanism to detect malicious domains with high accuracy. We introduced a graph convolutional neural network, which performed well in graph correlation tasks to detect malicious domains. We integrated the graph convolutional neural network with an attention mechanism to intelligently blend the effects of different relationship features on the classification results of domains for different types of domains.
  • We made a comprehensive evaluation of our work with real-world data collected from an educational network. The results demonstrate that it has good performance in detecting malicious domains.
Organization. The rest of this paper is organized as follows. Section 2 reviews related work and its limitations. Section 3 illustrates preliminaries of our system, and Section 4 introduces the system structure and major components. Then, we make a comprehensive evaluation in Section 5. Finally, Section 6 concludes this paper.

2. Related Work

With many important breakthroughs in the application of deep neural network models to graphs, graph-based methods have attracted more and more attention from scholars in many research areas. For example, to better analyze large-volume and fast-changing streaming graphs, Gu and Jiang, et al., proposed a method to efficiently find the graph stream summary, PGSS-MDC [20], to address the problem of providing highly accurate and fast answers to graph queries within a certain time interval. Meanwhile, the use of graph-based methods to detect network intrusions have also become an important research direction; for example, to to detect the existence of multi-step attacks in a large number of security alarms in smart city application scenarios, Jia, et al., proposed the ACAM network security analysis framework [21], which is a network security analysis framework based on the Knowledge Representation Model (Multi-dimensional Data Association and Intelligent Analysis Model [22]). The model constructs a large-scale multidimensional association knowledge graph (MDATA graph) through existing security analysis reports and databases, then converts the detected network attack rules into MDATA subgraphs and generates real-time alarm MDATA graphs by extracting information from each alarm in the alarm flow, and finally detects multi-step attacks through a subgraph matching algorithm.
Detecting malicious domain names is one of the important research directions in network security, and researchers have proposed many approaches to recognize malicious domains. These works can be divided into two categories: feature-based methods and relation-based methods.

2.1. Feature-Based Methods

The typical approach taken by this research is to observe the characteristics of domains from multiple perspectives and extract related features according to security experts’ observations and knowledge. Then, a classifier, usually a supervised one, is used to recognize malicious domains with these features.
Earlier works usually used manual features from many angles. Antonakakis, et al. [1] proposed a system named Notos for dynamically assigning reputation scores to unknown domains, which extracts relevant statistical features to calculate the reputation scores of unknown domains, including Network-based features(The number of IPs associated with domain x, the number of autonomous systems in which the IPs to which domain x refers, etc.), Zone-based features(Average length of domains associated with domain x, number of different TLDs, etc.) and Evidence-based features(The number of malware linked to domain x, etc.) Similar to previous works, Bilge et al. [2,3] presented a system named Exposure to identify domains that are involved in malicious network campaigns like phishing and malware spam, etc. They extracted 15 time-based, answer-based, TTL-based, and name-based features from passive DNS data and classified malicious domains with the decision tree algorithm. Based on the work of Antonakakis, et al. [4], Chinba, et al. [5] proposed the concept of “temporal change patterns” (TVPs) from the perspective of domain name behavior and proposed the DomainProfiler model. The “time-varying model” defines the changes in the black/white list of a domain name in a certain time window as four pattern features (Null, Fall, Rise, Stable), which are combined with the Network-based features and Zone-based features proposed in the Notos system to form feature vector of the target domain name. Finally, the decision tree algorithm was used to predict the classification of the target domain name.
Most of the aforementioned malicious domain detection systems are passive in detection, and abnormal domains are only detected after malicious activity has begun, therefore delaying the detection of anomalies. By the time the malicious domain is detected, it has already conducted malicious activities for a long time. To solve this problem, Hao et al. [6] proposed the PREDATOR active reputation system, which identifies malicious domains at registration. In addition to utilising basic domain information features, the PREDATOR system also incorporates domain registration history features and Batch Correlation features, verifying the registering agent for re-registered domains, the dormant time of re-registered domains, the different lifecycle ratios of domains registered in the same batch, and the possibility of registering a large number of domains in batches. Finally, the Convex Polytope Machine supervised learning algorithm was used for classification.
In addition to using basic information about the domain name and the IPs associated with the domain name as features, there are several studies that have focused on detecting the lexical distribution of algorithm-generated domains. They generally analyzed domain strings and extracted lexical features from multiple perspectives such as structure, linguistics, pronunciation, and statistic [7,8,9] to detect DGA domains. Recently, some researchers have leveraged a deep learning model to extract lexical features automatically. Woodbridge, et al. [10] proposed a malicious domain detection system with LSTM (long short-term memory) network. The system takes domain name string sequence as input and extracts character features with the LSTM model intelligently. On the basis of this, Tran, et al. [11] presented the LSTM-MI model, which combines both binary and multiclass classification models to deal with the multiclass imbalance problem in LSTM. Xu, et al. [12] combined an N-gram and a convolutional neural network to extract lexical features. The model they proposed takes the domain itself as input and automatically estimates whether the domain is generated by DGA malware.

2.2. Graph-Based Methods

Approaches in this category usually leverage graphs to describe diverse relations among domains and then infer malicious domains with labeled nodes and relationships.
Early research discovered malicious domains with only one relation. Manadhata, et al. [14] depicted client relations among domains with a client-domain bipartite graph. Then, malicious domains were detected using the belief propagation algorithm. Khalil, et al. [15] came up with resolution relations between domains with the domain–domain graph, and then calculated the malicious score for each unknown node according to their relations with known malicious domains. Building on the work of Khalil, et al. [15], Lee, et al. [23] analyzed the sequential correlation of domains and further generated a domain–travel graph to detect malicious domains. Unlike the above work focusing on the resolution relationships between domain names, Peng, et al. [24] focused on cname records. They constructed a domain–domain graph based on their cname relations and inferred whether an unknown node was malicious or not with the belief propagation algorithm.
As research progressed, researchers found that combining multiple relationships revealed more underlying semantics and deeper relationships, making the detection of malicious domain names more effective. Zou, et al. [19] constructed a client–domain–IP graph and leveraged a propagation algorithm to discover malicious nodes. Najafi, et al., instead proposed the MalRank system [25], which uses information collected by an enterprise’s internal SIEM, derived from, for example, proxy, DNS, and DNCP logs, to construct a knowledge graph that identifies malicious entities by calculating and iterating a malicious score over the entities in the knowledge graph. Lei, et al. [17] built three domain relation graphs and leveraged the graph embedding technique to obtain features. For Sun, et al., to derive more information and discover deeper semantics, a DNS scenario was recreated by constructing a heterogeneous information network, and six meta-paths were used to combine the heterogeneous information network to identify malicious domain nodes [16,18].

2.3. Discussion

Feature-based works have mainly depended on hand-crafted features that are manually extracted by experts’ observation. Researchers have collected DNS data (e.g., DNS traffic, DNS logs, passive DNS information, etc.) and other supplemental data (e.g., whois data, geographic information for IP, etc.) and analyzed malicious network activities. Summarized features can effectively distinguish malicious domains. However, these works are inevitably limited by experts’ observation and knowledge. In addition, they can be evaded easily by attackers since features are usually for one individual domain. For instance, nearly all lexical features may lose efficacy if attackers use new technology like GAN to avoid detection [26,27,28]. Graph-based works leverage multiple relations among domains based on the basic observation that attackers reuse attack resources when limited. Recent works have tended to combine multiple relations to obtain a more accurate result. However, these works treat various relations equally and ignore their different influences. For example, resolution relations are usually more important than client resolution in malicious domain detection.
In this paper, we solved the above problems with a graph convolution network (GCN) and attention mechanism. Specifically, we derived domain features intelligently with the GCN model to avoid manual prejudice and combined multiple relations with the attention mechanism to balance the influence of different relations. Our work can generate a more comprehensive analysis using this method.

3. Preliminaries

In this section, we first introduce the relations we used in this work and the reason why they are effective in detecting malicious domains. Then, we summarize the techniques used in our work.

3.1. Domain Relations

We investigated previous works detecting malicious domains utilizing domain relations and summarized three typical relations that are widely used: client relation, resolution relation, and cname relation.
Client Relation [13,14]. The client relation indicates that a domain queried by a client that has requested malicious domains is more likely to be malicious. There are two basic observations of the client relation. (1) Compromised clients are unlikely to query only one malicious domain since they commonly query a set of domains to achieve one malicious activity, and they are usually involved in more than one attack. (2) Malicious domains have no reason to be accessed by benign clients since they are dedicated to malicious services. As shown in Figure 1, domain d6 can be inferred to be malicious because it is requested by suspicious clients c1, c3, c5 and clearly far away from the benign network: no benign clients have queried it.
Resolution Relation [15]. The resolution relation means that domains with the same resolutions (hosted on the same IP) tend to have the same labels. This relation is strongly based on the rarity and expensiveness of malicious servers (IPs) where attackers can host domains. In fact, attackers widely reuse resources (especially malicious IPs) in malicious activities, such as domain-flux and fast-flux. As shown in Figure 2, domain d5 is considered to be malicious, because it is hosted on suspicious servers IP1, IP2, which host three known malicious domains d1, d2, d3 and no benign domains.
Cname Relation [24]. The cname relation in this paper indicates that domains that appear in one cname record have a homophilic state. In the DNS system, the cname record is used to set the alias name of domains. We illustrate an example of a cname record in Figure 3. When setting a cname record, the domain and its alias name finally point to the same destination: the client first queries www.example.com and obtains its alias name www.exmaple-alias.com, then the client queries www.exmaple-alias.com and obtains the final destination. It is obvious that domains connected to malicious domains through a cname relation are more likely to be malicious since they map to the equal resource.

3.2. Graph Convolutional Network (GCN)

Nowadays, graph convolutional networks (GCN) have yielded unusually brilliant results in multiple areas, such as document classification, recommender systems, traffic prediction, etc. GCN was first proposed by Kipf and Welling [29] in 2017. The GCN model takes both node attributes and graph structures as input and combines them to obtain the deep semantics of nodes. The propagation rule of multi-layer GCN can be expressed as:
H ( l + 1 ) = σ ( D ˜ 1 2 ( A + I ) D ˜ 1 2 H ( l ) W ( l ) )
where A is the adjacency matrix of graph, σ is a nonlinear active function, W ( l ) is the trainable weight matrix of lth layer, and H ( l ) is the activations matrix in lth hidden layer.

4. Method

4.1. Overview

As shown in Figure 4, AGCN-Domain is an intelligent malicious domain detection system with three main components: data preprocessor, relation graph constructor, and attention-GCN classifier. First, in the data preprocessor, we extracted four basic fields from raw DNS logs: domain, IP, client, and timestamp to represent domain behaviors in subsequent steps. Then, in the relation graph constructor, we constructed three types of relation graph of each domain, respectively: client relation graph, resolution relation graph and cname relation graph. Among them, client relation graph and resolution relation graph are mapped from two bipartite graphs while cname relation graph is generated from type cname records. Finally, we designed an attention-GCN classifier to mine deep features of domains by fusing information from multiple relation graphs to recognize malicious domains.
It should be noted that AGCN-Domain is a universal framework that is not limited to the three types of relations mentioned in this paper and can be extended to far more relations among domains.

4.2. Data Structure

The DNS record dataset we collected includes the following fields: timestamp string, source IP, domain name server IP, protocol name, requested domain name, request record type, answers, and TTLs rejected.

4.3. Data Preprocessor

The main purpose of this step is to formalize dirty raw data and improve system efficacy. We took DNS logs as input and focused only on cname and A records. We filtered out records that meet filtering rules and extracted four fundamental fields (requested domain name, domain-resolve-IP, source IP, timestamp string) from the remaining data. Specific filtering rules are as follows:
  • Corrupted records. There are some corrupted records from transmission errors in collected raw data, such as an incomplete record missing some fields.
  • Irregular domains. There are some irregular domains in the original data, which can be divided into two categories. One is that domains do not comply with domain naming rules, which is probably due to mistyping or misconfiguration, for example, containing commas in strings like youtube,com. The other is that the domains whose TLD (Top Level Domain) are not registered in IANA, which means that they are invalid Internet domains.

4.4. Relation Graph Constructor

In this step, we constructed three relation graphs, client relation graph, resolution relation graph, and cname relation graph, to describe relationships among domains.

4.4.1. Client Relation Graph

Client relation actually indicates that two domains share the same clients within a time window. In order to model such a relation, we first built a bipartite graph among clients and domains, and further generated the client relation graph.
The client–query–domain bipartite graph can be expressed as G c d = ( C , D t , E ) , where C and D t represent all clients in the network and all domains (with timestamps) they queried, respectively, E is a set of edges between domain and client. For instance, if client c i requests domain d j at time t, then an edge < c i , d j ( t ) > E exists to describe this behavior in the client–query–domain graph.
For depicting client relation among domains, we transformed the above client–query–domain graph into a client–relation graph G c r = ( D , E , W ) , where D and E are the set of domains and edges, and W represents weights of edges. If domain d i and d j share at least one client within one time window, there is an edge < d i , d j > E . The weight of < d i , d j > can be calculated based on Jaccard Similarity:
W ( d i , d j ) = | t = 1 n ( C i t C j t ) | | C i C j |
In the above formula, C i t and C j t are sets of clients that have queried domain d i and d j within time window t, respectively. C i and C j are sets of clients that have queried domain d i and d j in the network data, respectively.

4.4.2. Resolution Relation Graph

Resolution relation here indicates that two domains share the same IPs in the whole network data. To model such a relation, we first built a bipartite graph among domains and resolved IPs and further generated the resolution relation.
The domain–resolve–IP bipartite graph can be expressed as G d i = ( D , P , E ) , where D and P represent all domains in the network and all IPs where they once hosted respectively, and E is a set of edges between domain and IP. For instance, if domain d i once hosts on IP P j , then there is an edge < d i , p j ) > E to describe this behavior in domain–resolve–IP graph.
For depicting resolution relation among domains, we transformed the above domain-resolve-IP graph into a resolution relation graph G r r = ( D , E , W ) , where D and E are the set of domains and edges, W represents weights of edges. If domain d i and d j share at least one resolved IP, then there is an edge < d i , d j > E . The weight of < d i , d j > can be calculated based on Jaccard Similarity as follows:
W ( d i , d j ) = | P i P j | | P i P j |
In the above formula, P i and P j are sets of IPs hosting domain d i and d j , respectively.

4.4.3. Cname Relation Graph

Cname relation represents that domains belong to one cname record. Unlike the above two relation graphs, the Cname Relation Graph can be generated from raw DNS data directly.
The cname relation graph G n r = ( D , E ) is an unweighted graph, where D and E are the set of domains and edges. Edge < d i , d j > represents that domain d i and d j were once shown in one cname record.

4.4.4. Graph Pruner

Since local network data are noisy with massive domains, IPs, and clients, we deleted some nodes that cannot provide useful information from client–query–domain graph, domain–resolve–ip graph, and cname relation graph to increase AGCN-Domain’s performance and efficacy. We investigated former works [13,18] and set pruning rules as follows:
  • Popular domains. The basic intuition is that domains that have been queried by more clients are more likely to be legitimate. The typical example is that famous domains, such as google.com, can be queried by nearly all clients in the monitored local network. Processing such famous popular domains will take a lot of resources; thus, we pruned them to increase the efficacy of the system. A popular domain was defined as requested by more than 25% of clients.
  • Hyperactive clients. In our data, there are some very active clients that can query domains even 1,000,000 times one day. We analyzed them and found that they are proxies or forwarders: there may be hundreds of clients behind source IP. Such clients cannot provide valid client relation for domains, so we deleted them. We set the top 0.1% clients as hyperactive clients and removed them.
  • Inactive clients. There are some clients that query only a few domains. Such clients also cannot offer much information; thus, we set a threshold of N i c and removed clients querying fewer domains than this. The  N i c was set to 2 in our experiment.
  • Inactive IPs. The same as inactive clients, we erased IPs that host only one domain in our network data.
  • Exceptions. Similar to previous work [18], we kept malicious domains and their related information even when they complied with the above rules, considering that malicious domains usually are inactive to avoid detection.

4.5. Attention-GCN Classifier

Model. In this step, we took three graphs that described relations between domains from different views, as input and output are deep features for each domain. Considering that diverse relations have different influences for detecting malicious domains, we applied GCN with an attention mechanism in our model. The hidden layer is designed as follows:
H c r l + 1 = σ ( L c r H c r l W c r )
H r r l + 1 = σ ( L c r H r r l W r r )
H n r l + 1 = σ ( L c r H n r l W n r )
H L = i ( c r , r r , n r ) s o f t m a x ( α i ) H i L
L = D 1 2 ( A + I ) D 1 2
Formulas (4)–(6) extract features of relation graphs, and final hidden features can be calculated by combining them with attention mechanism as shown in Formula (7), where H i l denotes the feature convergence process of the lth convolutional layer on the relation graph i, H i l R n v i × d v , n v i denotes the number of nodes in relation graph i, i { c r , r r , n r } , H i 0 = ( X ) , and X denotes the initial embedding vector of the node, X R d v , L c r , L r r , L n r denote the normalized Laplacian matrices of the three graphs, respectively, A and D represent the adjacency matrix and degree matrix of the subgraph, respectively, I is the identity matrix, L c r , L r r , L n r R n v i × n v i , and  W r r , W c r , W n r denote the trainable weight matrices on three relational graphs, W r r , W c r , W n r R d v × d v , and σ ( · ) represents ReLU activation function.
α i is used as the attention score of the H i L , i { c r , r r , n r } , which is determined by the similarity between the hidden features of domain nodes and the h ¯ i vector, which is the mean of the H i L of the domains. α i is calculated as follows:
α i = α i 1 α i 2 α i n
α i k = S i m i l a r i t y h i k L , h ¯ i = h i k L · h ¯ i t h ¯ i · h i k L
h ¯ i = v S i h i L ( v ) S i
where S i is the set of nodes of relation graph i.
Isolated points. In each relation graph, it is inevitable to have isolated nodes that have no edges with any other nodes, because it is impossible for all domains to have three relations with others in network data. Such nodes will be fairly negative to our model. Thus, we set one node’s related hidden vectors to zero if it was an isolated point in a graph to minimize the negative. For instance, the hidden vector h c r l ( d ) of domain d in layer l, which was an isolated node in client relation, was set to zero (Algorithm 1).
Algorithm 1 Attention-GCN Classifier
Input: 
client relation graph G c r ; resolution relation graph G a r ; cname relation graph G n r ;
       Parameters:
embedding size k; depth of layer L;
Output: 
feature vectors V of nodes (domains)
  1:
Initialize H ( 0 )
  2:
L c r N o r m a l i z e d A d j a c e n t M a t r i x ( G c r )
  3:
L r r N o r m a l i z e d A d j a c e n t M a t r i x ( G r r )
  4:
L n r N o r m a l i z e d A d j a c e n t M a t r i x ( G n r )
  5:
for l in range ( 0 , L 1 ) do
  6:
     H c r l + 1 = L c r H c r l W c r
  7:
     H r r l + 1 = L r r H r r l W r r
  8:
     H n r l + 1 = L n r H n r l W n r
  9:
end for
10:
H L = i ( c r , r r , n r ) s o f t m a x ( α i ) H i L
11:
V = F C ( H L )
12:
return V;
Loss Function For each domain, we obtained a k l e n g t h vector, which represented its deep features after the above processes. Then, distinguishing malicious domains could be regarded as a binary classification problem. Thus, a fully connected layer was set to judge whether a domain was malicious or not. Naturally, the training objective was to minimize the gap between our predicted results and labels of known nodes. Therefore, the loss could be measured by cross-entropy:
l o s s = d D L Q ( y d , y d ) = d D L ( y d log y d + ( 1 y d ) log ( 1 y d ) )
where D L is the set of all labeled domains, Q is the cross-entropy function, y d is the predicted result of domain d, and y d is its real label.
l o s s y d = ( y d y d 1 y d 1 y d )
y d = s o f t m a x H L · W f + b f W b + b b = s o f t m a x V · W b + b b
p a r a m s = p a r a m s η l o s s p a r a m s
where W b and W f are the trainable weight matrices for the output layer and the fully connected layer respectively, p a r a m s is the set of trainable parameters, η is the learning rate.

5. Evaluation

In this section, we implement a prototype of our system and evaluate it with multiple experiments from various perspectives.

5.1. Setup

To evaluate our system, we captured one week of DNS logs from a local education network in 2018. In our experiment, we only concentrated on successful A and cname records to obtain information. As described in Section 4, we collected four fields (domain, resolve IP, timestamp, and client) for A records to depict client relation and resolution relation, and two fields (domain and cname result) for cname records to obtain cname relation.
For labeling DNS data, we took ground truth from various sources: (1) Private blacklists/whitelists. The first and foremost ground truth we obtained was the blacklist and whitelist from a large Internet company. (2) Public blacklists/whitelists. We collected Alexa Top 10 K sites [30] for two months in 2018. The domain whose sTLD appeared every day in Alexa was marked as benign. Also, we collected public domain blacklists like 360 DGAs [31], malwaredomainslist.com [32], and Malc0de.com [33]. (3) Security Engine. We further leveraged VirusTotal [34], a popular security engine that has been used in many previous works as ground truth, to check our labels. Finally, we obtained over 10,000 labeled domains, including 2.6 K malicious domains and 8 K benign domains.
We implemented AGCN-Domain in Python 3.6 with PyTorch [35] to build the classifier and Networkx [36] to process the graph. In the following experiments, if there was no special explanation, we leveraged the 5-fold cross-validation technique to obtain final results, and the time window to capture client relation was set to one hour. Considering that it is impossible to infer the state of one domain with no relation to labeling nodes, we labeled at least one node for a component. In addition, we present the metrics used in the following experiment in Table 1.

5.2. Features

In the following experiments, we evaluated the effectiveness of our features by comparing fused domain relation features with individual relation features.
The results are shown in Table 2, where C-R represents features extracted from only client relation graph, R-R is for resolution relation, and N-R is for cname relation. In these experiments, parameters embedding size k was set to 10, layer L = 1 , and the initial labels fraction was 70%. The coverage ratio (CovRatio) in the table represents the domain ratio that the feature can cover. In other words, there were some domains that could not be predicted for one feature since they are isolated in the related graph. Such isolated domains were not calculated in the results for the individual feature. It can be seen that cname relation achieves the best result, but they can only cover a few domains, while our fused features can obtain good results with coverage of far more domains than the cname relation.

5.3. Initial Label Fraction

To further estimate the effectiveness of our system, we changed the initial label fraction of domains from 10% to 90% and observed the results, which are shown in Table 3. It is obvious that as the number of initially labeled domains increases, our system obtains better performance: the F1 score goes up from 0.87 to 0.94. It should be noted our work can detect malicious domains effectively even with limited labels: it obtains 94% accuracy and 98% precision even with only 10% initial labeled domains.

5.4. Sensitive Parameters

In this set of experiments, we analyzed the influence of parameters: embedding size k and hidden layer L.
We first set the initial label fraction to 10%, L = 2 , and experimented with k from 10 to 200. The results are shown in Figure 5. It can be seen from the figure that the embedding size k has little effect on the result: the accuracy and F1 score are almost unchanged as k increases from 10 to 200.
Then we set k = 10 , initial label fraction to 10%, and experimented with L from 1 to 7. The results are shown in Figure 6. Different from embedding size k, it can be seen that L has more impact on the result: with the increase of L, the performance slightly increases at first, but as it continually increases, the performance decreases rapidly.
The reason behind such a phenomenon is that in the beginning, nodes can obtain information from more neighbors, so they can contain deeper information on features. Once L exceeds a value, nodes obtain too much fusing information from far nodes and obtain invalid information on features.

5.4.1. Comparison with Other Models

We first compare our work with the other two models that are usually used for node classification tasks in the graph, DeepWalk and Node2Vec.
Then, we compare the AGCN-Domain model with the Basic GCN model to highlight the effectiveness of incorporating attention mechanisms in combining three relationship patterns for the detection of malicious domains.
  • DeepWalk [37]. DeepWalk learns representations of graph nodes from truncated random walks. For this experiment, we took the DeepWalk model to each relation graph and added them to obtain final embedding vectors for each node. Then, we leveraged a fully connected layer to distinguish malicious domains.
  • Node2Vec [38]. Node2Vec aims to learn the scalable features of nodes in the graph. Similar to DeepWalk, we took Node2Vec to each relation graph and obtained representations for each node, then added them to obtain final node features. Then, a fully connected layer was applied to predict malicious domains.
  • Basic GCN [29]. GCN is a famous model and has shown great performance in many areas. In this experiment, we took a basic GCN model to each graph and combined different relation features without an attention mechanism.
Table 4 shows the result. It can be seen that our work AGCN-Domain can obtain better performance than other models. The attention mechanism can effectively and intelligently combine different relation features with the consideration of their various significance.

5.4.2. Comparison with Other Malicious Domain Detection Systems

To further illustrate the performance of our system, we compared AGCN-Domain with three similar works [14,15,17] that detect malicious domains based on relations.
  • Manadhata, et al. [14] constructed a client-querying-domain bipartite graph to depict who is querying what. Then, the researchers labeled domains with ground truth and leveraged belief propagation algorithm to predict unknown domains’ states.
  • Khalil, et al. [15] generated a domain resolution relation graph to represent whether two domains are sharing common resolutions. Then malicious scores for nodes can be calculated based on their distance from all known malicious domains.
  • Lei, et al. [17] modeled domain behavior with three domain similarity graphs. The researchers derived domain features with graph embedding techniques from three graphs and detected malicious domains by concatenating these features.
To make the comparison, we implemented the above research locally in accordance with our understanding of their papers and experimented using the same local DNS data and labels. Table 5 shows the result. It can be seen that AGCN-Domain even covers more domains than other works with better performance (their limited relations make some domains untapped). It is probable that our system leverages more relations, and our model can obtain deeper and more essential features.

6. Conclusions

In this paper, we proposed a novel system named AGCN-Domain, which can distinguish malicious domains intelligently by fusing multiple domain relations considering their influences. Specifically, AGCN-Domain first analyzes domain behaviors with client-query-domain and domain-resolve-IP graphs. It describes three relations, client relation, resolution relation, and cname relation, with a relation graph of domains. Finally, a model composed by GCN and attention mechanism is applied to obtain and combine deep features of domains which are extracted from various relation graphs. We set multiple experiments from different angles to evaluate AGCN-Domain using 7 days of real-world data. We compared our AGCN-Domain model with classical graph representation algorithms, including Deepwalk, Node2vec, and BasicGCN. These algorithms simply sum up the node vector representations learned from the three relationship subgraphs without specifically focusing on the feature representation of any particular relationship subgraph or considering the combined representation of node vectors from the three relationship subgraphs. Consequently, their performance in experiments is inferior to our proposed AGCN-Domain model, demonstrating the effectiveness of our proposed attention mechanism. In comparison to the three malicious domain detection models proposed by Manadhata, Khalil, and Lei, which only consider one type of relationship between domains, the AGCN-Domain model comprehensively integrates the representations of three types of relationships through attention mechanisms, thereby highlighting the feature representations of malicious domains and improving the accuracy and F1 score of malicious domain detection. In the case of a 10% label initialization rate, the AGCN-Domain model exhibits excellent performance compared to other malicious domain detection systems, with an accuracy of 94.27% and an F1 score of 87.93%. Table 4 and Table 5 show that although the AGCN-Domain model performs well in terms of accuracy and precision, its recall rate is slightly lower than that of other malicious domain detection systems. We speculate that this is because some malicious domain relationship patterns are similar to benign domains, requiring further extraction of deeper relationship features. In summary, our proposed AGCN-Domain model intelligently integrates domain representation features from three relationship patterns through attention mechanisms. Its comprehensive performance surpasses that of the compared malicious domain detection systems, balancing precision and recall, reducing the number of false malicious domain alarms, and decreasing the workload of security personnel. This demonstrates the superiority of our proposed AGCN-Domain model. In future work, we plan to propose a method that can process massive data to improve the efficiency of the detection system and explore a real-time method of detecting malicious domains.

Author Contributions

Conceptualization, supervision, X.L.; methodology, Y.L. and H.C.; formal analysis, X.L. and Y.L.; investigation, original draft preparation Y.L. and H.C.; review and editing, X.L. and H.C.; project administration, X.L. and L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is in part supported by the National Key R&D Program of China (No. 2022YFB3104100), National Science Foundation of China (No. 62102109), and the Major Key Project of PCL (No. PCL2021A09, PCL2021A02, PCL2022A03).

Data Availability Statement

The data supporting all the findings in the article are accessible, and no additional source data are required.

Conflicts of Interest

Yixin Li was employed by Big Data Center of State Grid Corporation of China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Big Data Center of State Grid Corporation of China had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
AGCNGraph convolutional network with attention mechanism
GCNGraph convolutional network
CRClient relation
RRResolution relation
NRCname relation

References

  1. Antonakakis, M.; Perdisci, R.; Dagon, D.; Lee, W.; Feamster, N. Building a dynamic reputation system for dns. In Proceedings of the 19th USENIX Security Symposium (USENIX Security 10), Washington, DC, USA, 11–13 August 2010; USENIX Association: Berkeley, CA, USA, 2010; p. 18. [Google Scholar]
  2. Bilge, L.; Sen, S.; Balzarotti, D.; Kirda, E.; Kruegel, C. Exposure: A passive dns analysis service to detect and report malicious domains. ACM Trans. Inf. Syst. Secur. (TISSEC) 2014, 16, 1–28. [Google Scholar] [CrossRef]
  3. Bilge, L.; Kirda, E.; Kruegel, C.; Balduzzi, M. Exposure: Finding malicious domains using passive dns analysis. In Proceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS2011), San Diego, CA, USA, 6–9 February 2011; pp. 1–17. [Google Scholar]
  4. Antonakakis, M.; Perdisci, R.; Lee, W.; Vasiloglou, N.; Dagon, D. Detecting malware domains at the upper dns hierarchy. In Proceedings of the 20th USENIX Conference on Security (USENIX Security 11), San Francisco, CA, USA, 8–12 August 2011; USENIX Association: Berkeley, CA, USA, 2011; p. 27. [Google Scholar]
  5. Chiba, D.; Yagi, T.; Akiyama, M.; Shibahara, T.; Yada, T.; Mori, T.; Goto, S. Domainprofiler: Discovering domain names abused in future. In Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, 28 June–1 July 2016; pp. 491–502. [Google Scholar]
  6. Hao, S.; Kantchelian, A.; Miller, B.; Paxson, V.; Feamster, N. Predator: Proactive recognition and elimination of domain abuse at time-of-registration. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1568–1579. [Google Scholar]
  7. Schüppen, S.; Teubert, D.; Herrmann, P.; Meyer, U. {FANCI}: Feature-based automated nxdomain classification and intelligence. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 1165–1181. [Google Scholar]
  8. Yadav, S.; Reddy, A.K.K.; Reddy, A.; Ranjan, S. Detecting algorithmically generated malicious domain names. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, Melbourne, Australia, 1–3 November 2010; pp. 48–61. [Google Scholar]
  9. Schiavoni, S.; Maggi, F.; Cavallaro, L.; Zanero, S. Phoenix: Dga-based botnet tracking and intelligence. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment; Springer: Berlin/Heidelberg, Germany, 2014; pp. 192–211. [Google Scholar]
  10. Woodbridge, J.; Anderson, H.S.; Ahuja, A.; Grant, D. Predicting domain generation algorithms with long short-term memory networks. arXiv 2016, arXiv:1611.00791. [Google Scholar]
  11. Tran, D.; Mac, H.; Tong, V.; Tran, H.A.; Nguyen, L.G. A lstm based framework for handling multiclass imbalance in dga botnet detection. Neurocomputing 2018, 275, 2401–2413. [Google Scholar] [CrossRef]
  12. Xu, C.; Shen, J.; Du, X. Detection method of domain names generated by dgas based on semantic representation and deep neural network. Comput. Secur. 2019, 85, 77–88. [Google Scholar] [CrossRef]
  13. Rahbarinia, B.; Perdisci, R.; Antonakakis, M. Segugio: Efficient behavior-based tracking of malware-control domains in large isp networks. In Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, Brazil, 22–25 June 2015; pp. 403–414. [Google Scholar]
  14. Manadhata, P.K.; Yadav, S.; Rao, P.; Horne, W. Detecting malicious domains via graph inference. In European Symposium on Research in Computer Security; Springer International Publishing: Cham, Switzerland, 2014; pp. 1–18. [Google Scholar]
  15. Khalil, I.; Yu, T.; Guan, B. Discovering malicious domains through passive dns data graph analysis. In Proceedings of the ASIA CCS ’16: 11th ACM on Asia Conference on Computer and Communications Security, Xi’an, China, 30 May–3 June 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 663–674. [Google Scholar]
  16. Sun, X.; Wang, Z.; Yang, J.; Liu, X. Deepdom: Malicious domain detection with scalable and heterogeneous graph convolutional networks. Comput. Secur. 2020, 99, 102057. [Google Scholar] [CrossRef]
  17. Lei, K.; Fu, Q.; Ni, J.; Wang, F.; Yang, M.; Xu, K. Detecting malicious domains with behavioral modeling and graph embedding. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 601–611. [Google Scholar]
  18. Sun, X.; Tong, M.; Yang, J.; Xinran, L.; Heng, L. Hindom: A robust malicious domain detection system based on heterogeneous information network with transductive classification. In Proceedings of the 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), Beijing, China, 23–25 September 2019; USENIX Association: Berkeley, CA, USA, 2019; pp. 399–412. [Google Scholar]
  19. Zou, F.; Zhang, S.; Rao, W.; Yi, P. Detecting malware based on dns graph mining. Int. J. Distrib. Sens. Netw. 2015, 11, 102687. [Google Scholar] [CrossRef]
  20. Jia, Y.; Gu, Z.; Jiang, Z.; Gao, C.; Yang, J. Persistent graph stream summarization for real-time graph analytics. World Wide Web 2023, 26, 2647–2667. [Google Scholar] [CrossRef]
  21. Jia, Y.; Gu, Z.; Du, L.; Long, Y.; Wang, Y.; Li, J.; Zhang, Y. Artificial intelligence enabled cyber security defense for smart cities: A novel attack detection framework based on the MDATA model. Knowl.-Based Syst. 2023, 276, 110781. [Google Scholar] [CrossRef]
  22. Jia, Y.; Gu, Z.; Li, A. MDATA: A New Knowledge Representation Model: Theory, Methods and Applications; Springer Nature: New York, NY, USA, 2021; Volume 12647, pp. 1–255. [Google Scholar]
  23. Lee, J.; Lee, H. Gmad: Graph-based malware activity detection by dns traffic analysis. Comput. Commun. 2014, 49, 33–47. [Google Scholar] [CrossRef]
  24. Peng, C.; Yun, X.; Zhang, Y.; Li, S.; Xiao, J. Discovering malicious domains through alias-canonical graph. In Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, Sydney, Australia, 1–4 August 2017; pp. 225–232. [Google Scholar]
  25. Najafi, P.; Mühle, A.; Pünter, W.; Cheng, F.; Meinel, C. Malrank: A measure of maliciousness in siem-based knowledge graphs. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019; pp. 417–429. [Google Scholar]
  26. Anderson, H.S.; Woodbridge, J.; Filar, B. Deepdga: Adversarially-tuned domain generation and detection. In Proceedings of the AISec ’16: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security, Vienna, Austria, 28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 13–21. [Google Scholar]
  27. Fu, Y.; Yu, L.; Hambolu, O.; Ozcelik, I.; Husain, B.; Sun, J.; Sapra, K.; Du, D.; Beasley, C.T.; Brooks, R.R. Stealthy domain generation algorithms. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1430–1443. [Google Scholar] [CrossRef]
  28. Yun, X.; Huang, J.; Wang, Y.; Zang, T.; Zhou, Y.; Zhang, Y. Khaos: An adversarial neural network dga with high anti-detection ability. IEEE Trans. Inf. Forensics Secur. 2019, 15, 2225–2240. [Google Scholar] [CrossRef]
  29. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. Conference Track Proceedings OpenReview.net 2017. [Google Scholar]
  30. Alexa. Available online: https://www.alexa.com (accessed on 15 September 2022).
  31. 360DGAs. Available online: https://data.netlab.360.com/dga/ (accessed on 15 September 2022).
  32. MalwareDomainList. Available online: https://www.malwaredomainlist.com (accessed on 15 September 2022).
  33. Malc0de.com. Available online: https://malc0de.com/bl/ZONES (accessed on 15 September 2022).
  34. VirusTotal. Available online: https://www.virustotal.com (accessed on 15 September 2022).
  35. Pytorch. Available online: https://pytorch.org (accessed on 10 September 2022).
  36. Networkx. Available online: https://networkx.org (accessed on 10 September 2022).
  37. Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  38. Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Figure 1. Client Relations.
Figure 1. Client Relations.
Mathematics 12 00640 g001
Figure 2. Resolution Relations.
Figure 2. Resolution Relations.
Mathematics 12 00640 g002
Figure 3. Cname Relations.
Figure 3. Cname Relations.
Mathematics 12 00640 g003
Figure 4. Architecture of AGCN-Domain.
Figure 4. Architecture of AGCN-Domain.
Mathematics 12 00640 g004
Figure 5. Impact of embedding size k.
Figure 5. Impact of embedding size k.
Mathematics 12 00640 g005
Figure 6. Impact of layer L.
Figure 6. Impact of layer L.
Mathematics 12 00640 g006
Table 1. Confusion Matrix Calculation Table.
Table 1. Confusion Matrix Calculation Table.
Metrics NameInstruction
TPThe number of malicious domains predicted as malicious
FPThe number of benign domains predicted as malicious
TNThe number of benign domains predicted as benign
FNThe number of malicious domains predicted as benign
Accuracy(TP + TN)/(TP + FP + TN + FN)
PrecisionTP/(TP + FP)
RecallTP/(TP + FN)
F12 × (Precision × Recall)/(Precision + Recall)
Table 2. Effectiveness evaluation of malicious domain classification under different relationship models and the comprehensive assessment of the three relationship models.
Table 2. Effectiveness evaluation of malicious domain classification under different relationship models and the comprehensive assessment of the three relationship models.
Relation PatternAccuracyPrecisionRecallF1CovRatio
C-R0.94210.98570.79910.882795.04%
R-R0.97440.96500.89320.927766.96%
N-R0.99551.0000.83330.909111.48%
Fused0.96930.98140.89150.9343100.0%
Table 3. Result with different fraction of labels.
Table 3. Result with different fraction of labels.
Initial LabelsAccuracyPrecisionRecallF1
10%0.94270.98110.79670.8793
30%0.95640.97700.84870.9083
50%0.96430.98230.87310.9245
70%0.96930.98140.89150.9343
90%0.97200.98670.90240.9427
Table 4. Comparison with other models with 10% initial labels.
Table 4. Comparison with other models with 10% initial labels.
ModelAccuracyPrecisionRecallF1
DeepWalk0.91710.93320.73650.8233
Node2vec0.92120.93230.75390.8337
BasicGCN0.93410.97860.76400.8581
AGCN-Domain0.94270.98110.79670.8793
Table 5. Comparison with other detection systems with 10% initial labels.
Table 5. Comparison with other detection systems with 10% initial labels.
MethodAccuracyPrecisionRecallF1
Manadhata [14]0.89940.79980.87360.8351
Khail [15]0.94040.88690.81070.8471
Lei [17]0.92170.78100.86760.8221
AGCN-Domain0.94270.98110.79670.8793
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Luo, X.; Li, Y.; Cheng, H.; Yin, L. AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism. Mathematics 2024, 12, 640. https://doi.org/10.3390/math12050640

AMA Style

Luo X, Li Y, Cheng H, Yin L. AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism. Mathematics. 2024; 12(5):640. https://doi.org/10.3390/math12050640

Chicago/Turabian Style

Luo, Xi, Yixin Li, Hongyuan Cheng, and Lihua Yin. 2024. "AGCN-Domain: Detecting Malicious Domains with Graph Convolutional Network and Attention Mechanism" Mathematics 12, no. 5: 640. https://doi.org/10.3390/math12050640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop