In the Evaluation section, an evaluation of the proposed model is conducted, where the datasets utilized in our experiments are introduced, along with the experimental setup and the chosen parameters. The metrics employed for performance evaluation are detailed, and a comparison is made between the precision, accuracy, F1 score, recall rate, and average time consumption of GATv2, GraphSAGE, GCN, and our method. Moreover, the effects of varying the numbers of clusters and the sizes of layers on the performance of our proposed approach are discussed.
5.1. Dataset
In general, we can construct a network topology graph from a set of network flow data. In this graph, we represent the IP address of each network device as a node, and the network communication flows between network devices are constructed as edges in the graph. In formal terms, we can define the network topology graph as , where graph G consists of a node set V and an edge set E. We can represent such a graph in the format of an adjacency matrix, where for a network topology graph with n nodes and network flows, it can be represented as . If there exists network flow between network device i and network device j, then .
We used three publicly available botnet network graph datasets: C2, P2P, and Chord. These botnet network graph datasets were generated using the original network flow data from CTU-13 [
49]. The C2 and P2P botnet network traffic [
39] was generated from real malicious software traffic samples, while the Chord botnet network traffic was generated from synthetic malicious software traffic. The dataset is formed by embedding a P2P botnet into real traffic. All the graphs in the datasets are a mixture of botnet nodes and botnet network topological patterns with background traffic collected in 2018 from the Center for Applied Internet Data Analysis (CAIDA). In the C2 and P2P datasets, each graph contains approximately 3000 botnet nodes, while in Chord, each graph contains 10,000 botnet nodes. All the graphs are attribute-less, meaning that the node attributes are vectors of all ones. The statistical information for the C2, P2P, and Chord datasets is presented in
Table 1,
Table 2 and
Table 3.
Our experiments were conducted on a Linux server with two 2.1 GHz 16-core Intel(R) Xeon(R) processors and 256 GB memory and a Tesla A800 GPU. The proposed model was developed in Python using several deep learning packages, such as Sckit-learn, PyTorch Geometric, and PyTorch. For performing hyperparameter tuning, a grid search was performed to ensure the optimal settings were used. Our grid search values are given in
Table 4.
5.3. Effectiveness
We evaluated the performance of our proposed method on three datasets: C2, P2P, and Chord. The evaluation metrics include precision, accuracy, F1 score, and recall. Additionally, we also considered the average runtime as a reference for comparison with the performance of GATv2, GraphSAGE, and GCN.
The performance evaluation results are presented in
Table 5. The performance evaluation results indicate that on the C2 dataset, GATv2, GraphSAGE, and our approach all demonstrate high precision, with all exceeding 99%. Among them, our approach achieves the highest precision of 99.66%, surpassing GraphSAGE by 0.1%. However, all three methods achieve an accuracy of over 99.95%. In comparison to precision, the difference in accuracy is approximately 1%. The trend of F1 score is similar to that of precision. While our approach does not reach the highest recall, the difference compared to the optimal solution is only around 0.04%. In terms of ave_time, our approach takes approximately 33% longer than GraphSAGE.
On the other two P2P datasets, the performance of precision, accuracy, and F1 score remains consistent with what was mentioned earlier. In terms of recall, GATv2 and our method each achieve the best results on one dataset. However, GraphSAGE, which performed best on the C2 dataset, only achieves a recall rate of 76.16% on the Chord dataset. Apart from this, the recall rate fluctuates minimally across all four datasets.
It is worth noting that our method achieves the minimum ave_time on the Chord dataset, but on the P2P datasets, GraphSAGE still outperforms it. We speculate that this may be related to the graph partitioning and storage method after METIS partitioning. In comparison, GraphSAGE performs fixed-neighbor sampling for each node in the graph, allowing it to efficiently capture local neighborhood information and reduce the training time. However, this sampling strategy is not conducive to distributed detection of botnets.
Overall, the performance evaluation results indicate that our approach achieves the best precision and F1 score across all datasets, reaching 99.71% and 99.74%, respectively. For the Chord dataset, our approach outperforms other methods in all three aspects, even achieving an average runtime as low as 125.67 s.
In addition, we compared the impact of clustering and non-clustering on the model, as shown in
Figure 6. Under different numbers of model layers, the accuracy of clustering is higher than that of non-clustering. The impact of different cluster numbers on our approach was also assessed through 3-fold cross-validation, as shown in
Table 6. We set the cluster numbers to 100 and 1000 during the METIS partitioning process and considered batch sizes of 10, 20, 25, and 5, 10, 20, 50, respectively. We separately discussed the influence of cluster numbers on the detection results when using batch sizes of 10 and 20, as well as the impact of different batch sizes on the detection results when the cluster numbers were the same.
In
Table 6, when the batch size is 10, the precision is 0.086% higher with a cluster number of 1000 compared to 100. However, as the cluster number increases, for batch sizes of both 10 and 20, there is a certain degree of decrease in accuracy, F1 score, and recall.
When the cluster number is fixed, precision shows minimal fluctuations in both test groups. When the cluster number is 100, accuracy, F1 score, and recall exhibit an increasing trend with the increase in batch size, but the magnitude of improvement is not significant. When the cluster number is 1000, accuracy, F1 score, and recall reach their maximum values at a batch size of 50.
To summarize, it can be observed that as the cluster number increases, the predicted precision slightly improves, but the remaining metrics show more or less decline. When the cluster number is fixed, accuracy, F1 score, and recall all increase with the increase in batch size, showing a positive correlation with it.
In
Table 7, we evaluated the performance impact of different numbers of layers on our approach. A total of seven different numbers of layers were tested, including 6, 7, 8, 9, 10, 12, and 16, and even when the number of layers reached 16, our approach was still able to effectively distinguish between malicious and non-malicious nodes.
Similar to
Table 6, as the number of layers increases, we can observe in
Table 7 that the precision increases from 0.04% to 0.29%, while the accuracy, F1 score, and recall decrease by 0.044%, 0.416%, and 1.176%, respectively. At the same time, the average time consumed per layer increased by 60.39%. Compared to the improvement in precision with an increase in the number of layers, the impact on accuracy, F1 score, recall, and time consumption becomes more prominent.
To enhance the propagation of information from neighboring nodes and avoid gradient explosion, we utilize diagonal enhancement to process the embedding representation of each layer. As shown in
Figure 7, the comparison between diagonal enhancement and JumpingKnowledge in different layers shows that when the number of layers is six, the accuracy of the former is higher than that of the latter in the three datasets.
Based on the above findings, when the number of layers of the graph convolution network is six, the proposed model has strong generalization ability. We can determine that the proposed model’s optimal number of model layers in the three datasets is six.
The loss curves of the model on both the training and validation sets are plotted in
Figure 8. It can be observed that the losses on both sets showed a steady downward trend as the number of training epochs increased. It is worth noting that a very small difference between the losses on the training and validation sets was observed, indicating that the designed Cluster-GCN model performs well on the dataset without showing significant signs of overfitting.