Next Article in Journal
Optical Fibers as Dosimeter Detectors for Mixed Proton/Neutron Fields—A Biological Dosimeter
Previous Article in Journal
A Customized Efficient Deep Learning Model for the Diagnosis of Acute Leukemia Cells Based on Lymphocyte and Monocyte Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Open Set Recognition for Malware Traffic via Predictive Uncertainty

1
State Key Laboratory of Mathematical Engineering and Advanced Computing, PLA Information Engineering University, Zhengzhou 450001, China
2
National Digital Switching System Engineering Technological Research Center, PLA Information Engineering University, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(2), 323; https://doi.org/10.3390/electronics12020323
Submission received: 20 December 2022 / Revised: 2 January 2023 / Accepted: 3 January 2023 / Published: 8 January 2023
(This article belongs to the Special Issue Recent Advances in AI-Enabled Internet of Things Security and Privacy)

Abstract

:
Existing machine learning-based malware traffic recognition techniques can effectively detect abnormal behaviors in the network. However, almost all of them focus on a closed-set scenario in which the data used for training and testing come from the same label space. Since sophisticated malware and advanced persistent threats are evolving, it is impossible to exhaust all attacks to train a complete recognition model under the existing technical conditions. Therefore, recognition in the real network is an open-set problem, i.e., the recognition system should identify unknown and unseen attacks at test time. In this paper, we propose an uncertainty-aware method to identify known malicious traffic accurately and handle unknown traffic effectively. This method employs predictive uncertainty in deep learning as an indicator for unknown class detection. The predictive uncertainty represents the confidence in neural network predictions. In particular, the Deep Evidence Malware Traffic Recognition (DEMTR) model is presented to provide the multi-classification probability and predictive uncertainty in open-set scenarios using evidential deep learning. We demonstrate the performance of DEMTR on the MCFP dataset. Experimental results indicate that the proposed model outperforms the baseline methods in accuracy and F1-score.

1. Introduction

Along with the spectacular improvement in and comprehensive application of big data, the Internet of Things (IoT), and cloud computing, the network has become pervasive in people’s daily lives. Correspondingly, network attacks have become increasingly frequent, and the network faces numerous attacks and threats [1]. The malicious traffic generated by network attacks is among the main network security threats and the key objective of network security monitoring. The application of Artificial Intelligence in malicious traffic recognition can fully utilize large amounts of traffic and numbers of logs in cyberspace and give play to the advantages of mining the characteristics and associations of massive data. At present, AI has been applied in many industrial and commercial products, such as Cisco Firepower NGIPS and Nsfocus NGIPS.
Malware traffic recognition aims to classify network traffic containing malicious behaviors into some predefined traffic classes (closed sets) [2]. The existing malware traffic recognition methods use supervised or unsupervised learning to identify network attacks. Supervised learning trains data-driven classifiers on traffic samples of known classes. As a result, it achieves satisfactory results on closed sets, but it does not take into account samples outside the training set [3]. Once a strange sample is submitted to the classifier, it may be misclassified as a predefined class, resulting in a high false alarm rate. Unsupervised learning methods, such as clustering, achieve the goal of traffic classification by gathering unlabeled samples from the same class in the feature space [4]. Therefore, unsupervised learning methods can naturally deal with unknown classes. However, their accuracy is not high enough when dealing with high-dimensional traffic data and their use in practical projects is limited. In addition to more flexible attack means, the number of malware categories is growing rapidly as new polymorphic malware and zero-day threats continuously emerge. Consequently, many malicious attacks remain undiscovered, and the unknown traffic generated by them brings potential threats to network management [5], which becomes the main obstacle to improving the performance of malicious traffic identification systems.
In practice, the classifiers in use will inevitably receive data from categories it has never seen before. Malware traffic recognition in the real world is essentially an open-set recognition problem in which the classifier should accurately identify known malicious traffic and distinguish unknown traffic when it appears. The core of open-set recognition is the ability to distinguish open-set data outside of K closed-set classes, which is more challenging than closed-set identification and more significant for security-related applications [6]. The difficulty involved in this process is how to model unknown classes without any unknown class instances. Existing efforts on open-set malware traffic recognition are quite limited, with few exceptions [7,8,9]. They mainly use a threshold-based unknown class detection scheme, which depends on how to train a classifier and perform unknown class discovery on this basis. Some studies have proposed using the maximum softmax probability value as an indicator for unknown class detection [9]. Here, the threshold is determined as the lower bound of the maximum softmax value of known class samples. However, the softmax outputs are often falsely high because of the normalized property of the softmax function, which eventually leads to large numbers of unknown class samples being incorrectly classified as known classes. The Open-CNN method [8] uses the distance between test instances and known class instances in the latent feature space and takes the upper bound of distance as the threshold. However, the distance function learned by the classifier from the training dataset cannot measure the test dataset correctly. Besides, it cannot play a full role in identifying unknown classes. These defects bring about performance degradation in open set recognition.
This paper proposes an uncertainty-aware open-set recognition method for malware traffic to accomplish the malicious traffic open-set identification task, which uses the predictive uncertainty in deep learning to discover unknown classes. Furthermore, the Deep Evidence Malware Traffic Recognition (DEMTR) model is presented by combining convolution neural networks with evidence theory. The major drawbacks of existing methods include a falsely high maximum softmax probability value and a weak generalization ability for distance metrics. To deal with these issues so that the model can better discover the unknown class data in the open-set malicious traffic identification task, we transform the original problem into an uncertainty estimation problem. DEMTR simultaneously accomplishes the multi-class classification and uncertainty estimation tasks using a deep neural network (DNN) to predict a Dirichlet distribution of class probabilities. The prediction procedure of DEMTR is described as evidence collection, which affords a foundation for quantifying the predictive uncertainty of diverse malware. In inference, known attacks will incur low uncertainty, while unknown attacks will incur high uncertainty so that the model can identify the unknown class. The main contributions of this paper are as follows:
  • The uncertainty quantification in deep neural networks is applied to the open-set malicious traffic recognition, which is a solution to the traditional closed-set methods’ inability to identify unknown attacks effectively.
  • A new model utilizing evidential deep learning is proposed that can quantify multi-class classification probability and predictive uncertainty. Also, the predictive uncertainty is calibrated with a new loss function. The proposed method can identify unknown classes to a certain extent.
  • Experiments are carried out on real datasets to test and verify the validity of DEMTR’s model. The proposed model significantly improved accuracy, F1-score, and other indicators compared with the existing methods.

2. Related Work

The open set recognition (OSR) problem was first discovered in the field of face recognition and then found ubiquitous in many fields. To reject unknown classes, the “1-vs-set SVM” method [10] was proposed by adding an additional hyperplane for each class to restrict the decision space of the class. On this basis, the Weibull calibrated SVM (W-SVM) [11] and PI-SVM [12] were successively proposed to calibrate the class confidence scores using statistical extreme value theory. To overcome the shortcomings of the softmax function in deep neural networks (DNNs), Bendale et al. [13] proposed OpenMax to constrain the open space risk of DNN models, and it is the first solution to apply deep learning to open-set recognition problems. To further enhance the unknown class detective ability of OpenMax, G-OpenMax [14] introduces a generative method to synthesize unknown class instances from known class data and train DNNs with synthetic data. Similarly, Generative Adversarial Networks (GANs) are used to simulate representative unknown class instances [15]. However, generative methods are subject to the authenticity and credibility of the generated data, and the classification results obtained by training with generated data are poor.
For highly structured traffic samples, detecting unknown classes is more complex, leading to more challenges for open-set recognition of malicious traffic. Zhang et al. [4] integrated supervised and unsupervised machine learning (ML) techniques to obtain additional confidence scores for determining whether test samples belonged to the unknown class. Bekerman et al. [16] proposed an end-to-end monitoring system named RTC to identify new threats by manually extracting features from different protocols and network layer traffic data. In another effort, Cruz et al. applied W-SVM to open-set intrusion identification and Weibull distribution to fit samples of the decision boundary to limit the open space risk [7]. However, ML relies on manually extracted features, which expend a great deal of labor power and material resources. As DL has a strong ability for feature extraction and automatic learning, the research on unknown attack detection is shifting from ML to DL. Javaid et al. [17] used Sparse Autoencoders to implement unsupervised feature learning, which was validated on the NSL-KDD dataset to detect unpredictable attacks. However, Sparse Autoencoders highly rely on training data, resulting in limited generalization ability. Inspired by RTC, the SEEN [18] approach employs siamese networks to obtain high-dimensional embedding representations of the samples. This approach sets a critical value between the known and unknown classes based on the distance between the samples. Employing the extreme value theory, Yong et al. [8] proposed an Open-CNN model to detect unknown network attacks. An Open-CNN calculates distances between activation vectors of each known class sample and average activation vectors of the class, fits larger distances to obtain an extreme value distribution model, and reassigns the activation vector to explain the unknown class. Nevertheless, the cross-entropy loss used by the model cannot directly motivate the class instances to be projected near the average activation vector. As a result, it leads to overlapping areas in the decision space and further reduces the accuracy of closed sets significantly. Besides, the distance function derived from training datasets is probably not the right metric for test sets, resulting in little effect on unknown class recognition.
Uncertainty estimation in deep learning aims to estimate the uncertainty of a prediction (the predictive uncertainty), which is important for safe decision-making in high-risk fields [19]. The most common way is based on separately modeling the uncertainty induced by models and the data. Inadequate knowledge leads to model uncertainty, which is an attribute of models and can be mitigated, while data uncertainty is irreducible because it is an inherent property of data distribution. Recently, deep learning uncertainty estimation has been used for out-of-distribution detection. To this end, Bayesian neural networks (BNNs) have been used to quantify predictive uncertainty [20] as a technique to improve the detection rate of out-of-distribution samples. The results show that the uncertainty has the potential to be used for out-of-distribution sample discovery. Similar to out-of-distribution detection, open set recognition also needs to find samples with semantic deviation. Inspired by out-of-distribution detection, this paper attempted to use deep learning uncertainty to identify malicious traffic in the open set setting. However, BNNs are limited by difficult exact posterior inference and complex sampling operations during uncertainty quantification. To solve this dilemma, the present study adopts evidential deep learning [21] instead of BNNs to help build uncertainty-aware deep learning models, and the uncertainty representation is learned directly without sampling.

3. Method

In this section, we design a malware traffic detection model orienting open environments named DEMTR. The DEMTR learns distinguishable features for malicious traffic recognition by training and models the uncertainty of the prediction to reject unknown samples. Traffic with high uncertainty is considered unknown, while traffic with low uncertainty is classified based on the learned classification probability. The overall framework of the proposed method is shown in Figure 1.

3.1. Problem Definition

Given the training dataset D t r = { ( x i , y i ) } i = 1 N , where x i , y i Y t r = { 1 , 2 , , k } , x i is a traffic session, y i is the label of x i , and N represents the total sessions. The testing dataset D t e = { ( x i , y i ) } i = 1 , y i Y t e = { 1 , 2 , , k , , K } , K > k , D t e is open, which includes attack categories that are not present in the training set. Our method aims to get a model M : x y , x D t e , y Y o s = { 1 , 2 , , k , u n k n o w n } , where an instance x labeled as unknown is categorized as a new class that did not emerge during the training phase.

3.2. Data Preprocessing

In this section, the composition of the traffic data used for model training is described in detail. This paper used the raw network traffic packets for network attack detection. Unlike the commonly used manual traffic packet feature extraction methods, this method does not need to design or filter the traffic features to be extracted and can retain all packet information. The Original Flow Data Extraction is shown in Algorithm 1. The data preprocessing includes three key steps, with detailed descriptions presented as follows:
  • Session splitting: The original traffic files are divided into sessions based on five-tuple information.
  • Packet processing: This step removes useless and interfering information from the data packets. It first removes the Ethernet layer. Since the three fields in the Ethernet layer have little effect on the gain of traffic classification [22], the data in the Ethernet layer are not used in this paper. Then IP addresses are anonymized. To avoid the model treating various IP addresses in the network layer as a critical factor for attack identification, both the source and destination IP address in the network layer header should be set to 0.0.0.0 in the feature extraction process. Finally, the UDP packet header is filled. As the packet headers of TCP and UDP have unequal lengths, considering the uniformity of feature structure, the UDP packet header is filled with 0x00 of 12 bytes to make its length 20 bytes.
    Algorithm 1. Original Flow Data Extraction
    Input:  D = { x i } i = 1 n : network traffic pcap files, n 1 : packet_number, n 2 : byte_length
    Output: Original flow feature set Y
    1: S ; Y  // Initialize traffic session set S and flow feature set Y
    2:for x i D do
    3:Extract packets P i = { p j } j = 1 m and the five-tuple information A i = { q j } j = 1 m from traffic packages
    4:do
    5:Gather packets with the same five-tuple information in P i into s = { p k } k = 1 a
    6:Add s to S
    7:until all packets in P i are selected
    8:end for
    9:for s S do
    10: i 0 ; f l o w _ f e a t u r e
    11:for p k s do
    12: p k m o v e _ I n t e r n e t _ l a y e r ( p k )  // Discard the Ethernet layer of packets
    13: p k u d p _ p a d d i n g ( p k )  // Pad UDP
    14:if L e n g t h ( p k ) > n 2 then
    15: p k t _ f e a t u r e p k [ 0 : n 2 ]  // Intercept the first n 2 bytes of the packets
    16:else
    17: p k t _ f e a t u r e z e r o _ p a d d i n g ( p k , n 2 )  // Fill the packets to n 2
    18:end if
    19:Add p k t _ f e a t u r e to f l o w _ f e a t u r e ; i = i + 1
    20:if i n 1 then break
    21:end for
    22:if i < n 1 then
    23: f l o w _ f e a t u r e z e r o _ p a d d i n g ( f l o w _ f e a t u r e , n 1 * n 2 )  // Fill the flow feature to n 1 * n 2
    24:end if
    25:Add f l o w _ f e a t u r e to Y  //add the flow feature vector to the session feature vector
    26:end for
    27:return Y
  • Feature vectorization: Since each session contains a different number of packets and each packet contains different lengths of bytes, we extract the first n 1 packets of each session and the first n 2 bytes of each packet to ensure that the data input to the model has the same dimension. Therefore, the final dimension of the session feature is n 1 × n 2 . If a session contains fewer than n 1 packets, it is padded with 0; otherwise, only the first n 1 packets are retained. A similar operation is taken for the bytes of a packet. To achieve optimal malware detection performance, the appropriate hyper-parameters are settled by comparing specific values of packet number and byte length, as detailed in Section 4.3.1.

3.3. DEMTR Model

The softmax function has long been in common usage in existing deep learning models, and the maximum softmax output is usually served as the credibility of the predictions. However, softmax output tends to be too “confident” in model predictions, even for wrong predictions [23]. To overcome the limitations of DNNs based on the softmax function, this paper uses evidence-based uncertainty estimation techniques to formalize multi-class classification and uncertainty modeling jointly. By placing a Dirichlet distribution on the class probabilities, we treat predictions of a neural net as subjective opinions and learn the function that collects the evidence leading to these opinions using a deterministic neural net from data. The structure of the DEMTR model is shown in Figure 2. As can be seen, DEMTR is split into two logical parts, namely, evidence generation and result derivation.
Evidence refers to the support collected from data and facilitates a sample to be classified into a certain class. In this study, a one-dimensional convolutional neural network (1D-CNN) is used to generate an evidence vector because neural networks can capture evidence from input data to induce classification opinions. Besides, a CNN is more suitable for processing data with higher feature dimensions (e.g., images, texts, and encrypted traffic) compared with other deep learning models because it has the characteristics of parameter sharing and sparse connection and is good at extracting the data’s local features [24]. Different from a 2D-CNN, a 1D-CNN does not need to convert inputs into two-dimensionality and can retain maximum information of original data, which is conducive to the classification of encrypted traffic. As shown in Figure 2, the evidence generation part includes two convolutional layers, a pooling layer, and two fully connected layers. The convolutional layer is targeted at extracting distinguishable characteristics from the input and dividing the global feature information into multiple local feature matrices, while the pooling layer performs dimension reduction and feature compression. In this study, we adopted maximum pooling (i.e., the maximum value of a certain local data is selected as the representative of the local data). Then, two fully connected layers are intended to map the latent space calculated by previous layers to label space and alleviate the impact of feature location on the classification results by integrating multidimensional feature vectors into several values. Vectorized session data is passed through the convolutional, pooling, and fully connected layers in sequence and then transformed into evidence. In particular, given a sample x ( i ) for K-class classification, the corresponding evidence e ( i ) is denoted as:
e ( i ) = g ( f ( x ( i ) ; θ ) )
where f ( ) with parameters θ is learned by neural networks and g ( ) is an evidence function that keeps the evidence e ( i ) non-negative. The evidence function can be implemented by the activation function (i.e., RELU and Sigmoid) to ensure that the network outputs a non-negative evidence vector e ( i ) .
In the result derivation part, evidential deep learning is applied to quantify the classification uncertainty. This type of learning can model both the classification probability and the overall uncertainty. Subjective logic [25] treats the multi-classification problem as a belief mass assignment problem, assuming that the overall belief mass is constant. For the K classification problem, the belief mass is divided into K + 1 shares, which represent the belief mass of each class and the confidence in the current prediction, respectively. Each share is non-negative, and the sum of these K + 1 values is 1:
u + k = 1 K b k = 1
where u denotes the overall uncertainty and u 0 , b k is the belief mass of class k and b k 0 .
Subjective Logic theory converts learned evidence e ( i ) into concentration parameters of a Dirichlet distribution through α ( i ) = e ( i ) + 1 . The Dirichlet distribution is regarded as a conjugate prior for the category distribution so that the DNN can present uncertainty while outputting the prediction results. The resultant predictor for a multi-class classification problem is another Dirichlet distribution whose parameters are set by the continuous output of the DNN. In the result derivation step, the concentration parameters of the Dirichlet distribution need to be determined. These parameters have a direct bearing on the uncertainty of prediction results. For a sample x ( i ) with an evidence vector e ( i ) = [ e 1 ( i ) , , e K ( i ) ] , the Dirichlet distribution D i r ( p ( i ) | α ( i ) ) is derived with parameters α ( i ) K , α ( i ) = [ α 1 ( i ) , , α K ( i ) ] . Then, the belief mass b k and uncertainty u are calculated as follows:
b k ( i ) = e k ( i ) S ( i ) = α k ( i ) 1 S ( i ) , u ( i ) = K S ( i )
where S ( i ) is the total strength of the Dirichlet distribution, expressed by S ( i ) = k = 1 K α k ( i ) . From Equation (3), it can be inferred that the larger the amount of evidence obtained for a certain class, the higher its belief mass. In contrast, the uncertainty is inversely proportional to the total amount of observed evidence, such that the smaller the total amount of evidence, the greater the uncertainty.
A standard neural network classifier delivers a definite probability assignment of the possible classes to which a given sample belongs. However, the Dirichlet distribution parameterized on evidence denotes the density of each such probability assignment. Thus, it models second-order probability and uncertainty. The expectation probability that x ( i ) is classified as the kth class equals the mean of the corresponding Dirichlet distribution, which is calculated as:
p k = α k S
For clarity, the above formulas are further elaborated, taking the triple classification task as an example. Assuming evidence e = 30 , 0 , 0 , the Dirichlet concentration parameter α = 31 , 1 , 1 can be obtained, then its class probability p = 0.94 , 0.03 , 0.03 and uncertainty u = 0.09 are calculated to ensure whether sufficient evidence is observed to obtain a confident prediction. On the contrary, given e = 0.01 , 0.01 , 0.01 , the Dirichlet concentration parameter is α = 1.01 , 1.01 , 1.01 so that the uncertainty u is about 1. The evidence is highly insufficient, leading to a doubtful classification result. When e = 1 , 1 , 1 , there is still a high uncertainty, although the uncertainty is reduced compared with the second case.

3.4. Training and Optimization

This section focuses on how to train a neural network to obtain classification evidence for each sample. The evidence is used to calculate the corresponding classification probability and the overall uncertainty. When a feature of the sample is associated with one of the K classes, the corresponding evidence is added, and the Dirichlet distribution is updated based on this finding. In this respect, specific patterns in network traffic samples may help classify them into a particular class, further illustrated by the example of remote-control malware njRAT. The traffic generated by the remote-control malware njRAT typically has characteristics such as more upstream traffic than downstream traffic and an increased proportion of packets with a PSH flag and SYN flag. If the network traffic has these characteristics, it is necessary to increase the Dirichlet concentration parameter corresponding to the njRAT class.
Traditional neural network classifiers typically use cross-entropy loss to guide the model in the right direction for training. The cross-entropy loss is expressed as:
L c e ( i ) ( y ( i ) , p ( i ) ; θ ) = k = 1 K y k ( i ) log ( p k ( i ) )
where p k ( i ) is the predictive probability that x ( i ) belongs to the kth class. As for the proposed DEMTR model, for x ( i ) , given the evidence e ( i ) output by the neural network, the parameter α ( i ) of the Dirichlet distribution D i r ( p ( i ) | α ( i ) ) can be obtained. Adjusting the cross-entropy loss so that the model produces more evidence for the correct class for each sample, the modified loss function is abbreviated to the following form:
L m c e ( i ) ( y ( i ) , α ( i ) ; θ ) = [ k = 1 K y k ( i ) log ( p k ( i ) ) ] 1 B ( α ( i ) ) k = 1 K p k α k ( i ) 1 d p ( i ) = k = 1 K y k ( i ) ( ψ ( S ( i ) ) ψ ( α k ( i ) ) )
where ψ ( ) is the digamma function.
The DEMTR model trained with L m c e can give the classification probability and predictive uncertainty. However, since its uncertainty has not been calibrated, it may be unreliable for unknown recognition directly. A well-calibrated model should be certain when it accurately predicts and give high uncertainty when it may be inaccurate. Moreover, it has been shown that the miscalibration of neural networks is related to the over-fitting of the negative log-likelihood [26]. Since the DEMTR objective in Equation (6) is equivalent to minimizing the negative log-likelihood, the trained model is likely to be over-fitted with poor generalization for open-set malware traffic recognition tasks. To calibrate the DEMTR model, we will follow the principles of [27] to maximize the Accuracy versus Uncertainty (AvU) utility function.
AvU = n A C + n I U n A C + n A U + n I C + n I U
where n A C , n A U , n I C , n I U denote the number of samples for the following four cases, namely accurate and certain (AC), accurate and uncertain (AU), inaccurate and certain (IC), and inaccurate and uncertain (IU), respectively. Figure 3 shows a toy example of the four possible model outputs. To calibrate the predictive uncertainty, the model is encouraged to learn a skewed and sharp Dirichlet distribution to get accurate predictions (see Figure 3a) and give an unbiased and flat Dirichlet distribution simplex for incorrect predictions (see Figure 3d). To this end, we propose regularizing the model training process by maximizing the expectations of AC and IU cases. We establish a logarithm constraint between the maximum class probability and uncertainty to maximize the AvU function, defining L AvU as:
L AvU ( i ) = log ( p m ( i ) ( 1 u ( i ) ) + ( 1 p m ( i ) ) u ( i ) )
where p m ( i ) is the maximum class probability of x ( i ) and u ( i ) is the corresponding evidential uncertainty. The class probability p k ( i ) should converge to 1 when the model predictions are accurate; otherwise, 0. Similarly, when the model predictions are certain, the uncertainty u ( i ) should converge to 0, but u ( i ) 1 when uncertain. L AvU is 0 only if all accurate predictions are certain and all inaccurate predictions are uncertain. The AvU loss function is designed to improve the uncertainty calibration as an additional penalty term in conjunction with existing loss functions. In summary, the objective of the DEMTR model is:
L D E M T R = i = 1 N L m c e ( i ) + L AvU ( i )
where N denotes the total number of samples in the training set.

4. Experiments and Analysis

4.1. Data Set and Experimental Environment

This study evaluates the proposed approach using the MCFP dataset [28]. MCFP consists of raw traffic data collected from real network environments and stores the data in the form of PCAP files. Besides, MCFP covers multiple types of malicious software with a great amount of data. In this study, 20 types of attack traffic are randomly selected from MCFP, 10 making a known class dataset and the rest forming an unknown class dataset. Furthermore, to maximize known data utilization while accurately reflecting the effectiveness of the model, the known class dataset is separated into the training set, validation set, and known class test set in the proportion 8:1:1, as suggested by [29]. The unknown class dataset is devoted entirely to tests. The details of known and unknown attacks used in experiments are described in Table 1.

4.2. Evaluation Metrics

Since the proposed DEMTR is essentially a K + 1-class classification model, the accuracy and F1-score are used as quantitative assessment indicators to reflect the model’s performance changes in the open-set setting. These indicators are defined below.
Accuracy: It defines the proportion of correct results predicted by models to the totality of data sets tested, which reveals the rightness rate of sample classification.
Accuracy = TP   + TN TP   + TN + FP + FN
F1-Score: Precision and recall are integrated into F1-Score and share the same weight. Notably, F1-score is a more effective evaluation index that seeks to strike a balance between precision and recall and can reflect the comprehensive performance of the model.
F 1 - Score = 2 × Precision × Recall Precision + Recall
Recall = TP TP + FN
Precision = TP TP + FP
Assuming that C represents a class in the test set, T P states the number of samples correctly marked as C , T N denotes the number of samples correctly classified as non- C , F P is the number of samples incorrectly classified as C , and F N counts the number of samples incorrectly classified as non- C .

4.3. Experimental Results and Analysis

The experiments in this study consist of three stages: The initial stage aims to determine the specific values of the hyper-parameters involved in data preprocessing in Section 3.2. In the second and third stages, based on the work in stage 1, the MCFP dataset is used to compare the proposed DEMTR with works in uncertainty estimation and open-set malware traffic recognition, respectively.

4.3.1. Comparison of Hyper-Parameters in Feature Extraction

The data preprocessing step in this paper involves a set of hyper-parameters ( n 1 , n 2 ) , representing spatial features of the attack traffic whose product indicates the length of the extracted spatial features of sessions. To meet the input requirements of the 2D-CNN model, some researchers determined the value as a quadratic power form, such as 784 (28 × 28) or 1521 (39 × 39), and constructed a two-dimensional feature matrix with symmetric length. In this research, since a 1D-CNN is used to process spatial features, there is no need to consider this factor.
The hyper-parameter n 1 represents the number of packets selected from each session. The statistical analysis of the traffic sessions in the dataset reveals that more than 95% of the sessions have more than 5 packets, and more than 95% have fewer than 30 packets. Therefore, the range of n 1 is [5,30] with an interval of 5, and there are six different options {5, 10, 15, 20, 25, 30}. The hyper-parameter n 2 indicates the length of intercepted bytes of each packet. The maximum transmission unit specified by the Ethernet protocol constricts packet length, which cannot exceed 1500 bytes. The protocol headers in the network layer and transport layer of encrypted traffic packets are retained because they are useful for attack identification. To cover the protocol header data and the transport layer payload of some specific data transmission steps as much as possible, n 2 ranges from 100 to 1500 with an interval of 200 in this experiment, and there are eight possible cases, i.e., {100, 300, 500, 700, 900, 1100, 1300, 1500}. There are 48 combinations of n 1 and n 2 , and traversing the combinations can capture the dependencies between n 1 and n 2 to obtain better recognition performance.
For the malware traffic dataset, Table 2 presents the F1-score of the DEMTR model for different ( n 1 , n 2 ) . Concerning the overall trend, the F1-score goes up with the number of selected packets in a traffic session and the length of intercepted bytes in a packet. Overall, the more bytes and packets you feed into the model, the better the results, but the longer the training time will be. Experimental results show that the F1-score is not simply positively linear related with ( n 1 , n 2 ) . Moreover, blindly increasing the dimensions of extracted features may cause the effective features to interfere, thereby weakening the effect of the whole model on malicious traffic identification. In all the cases listed in Table 2, the F1-score reaches the peak value when n 1 = 30 and n 2 = 1500, and the second-best F1-score is achieved by 20 and 500. Considering the effort in feature extraction and the time and resource requirements in training, ( n 1 , n 2 ) is selected as (20, 500).
In the hyper-parameter comparison experiments, only 800 samples were randomly selected for each class to shorten running time and save running resources. Although the training samples were insufficient for the model to completely fit data distribution, it still captured the impact of hyper-parameters on model identification performance and gave the optimal option for hyper-parameters.

4.3.2. Comparison in Uncertainty Estimation

To prove that the predictive uncertainty derived from the DEMTR model can distinguish between known classes and unknown classes, this paper compares it with two representative uncertainty estimation methods, i.e., BNN SVI [30] and MC Dropout [31]. BNN SVI provides predictive uncertainty by approximating the posterior distribution of the neural network parameters. On the other hand, MC Dropout uses Dropout as a regularization term and repeats the inference several times to calculate the predictive uncertainty. The unknown detection performance can be evaluated by the histogram statistics in Figure 4. According to the figure, the uncertainty intervals of known class samples and unknown class samples are highly overlapping in BNN SVI and MC Dropout. In comparison, the proposed method assigns smaller uncertainty to known class instances and larger uncertainty to unknown class instances, which can better distinguish the known class from the unknown class using uncertainty.

4.3.3. Comparison in Malware Traffic Open Set Recognition

We compare the proposed DEMTR with state-of-the-art malware traffic recognition models to verify its effectiveness. The baseline models are: (i) CNN, a one-dimensional convolutional neural network model based on softmax function, directly uses softmax output as confidence filtering low confidence samples to adapt to open-set scenarios, (ii) CNN_LSTM [32], a state-of-the-art intrusion detection model based on deep learning, and (iii) Open-CNN [8], a model that applies statistical extreme value theory and convolutional neural network to unknown attack detection.
The proposed model is trained using LDEMTR instead of the traditional cross-entropy loss for 20 iterations, and the batch size is set to 256. The proposed model uses the Adam optimizer with an initial learning rate of 0.0001, and the learning rate decays every 7 iterations. The threshold of CNN and DEMTR is determined by ensuring that 95% of training data is recognized as known. The Open-CNN and CNN_LSTM implementations are based on the corresponding literature. All models were trained on the training set and tested on single-type unknown attacks and multiple types of unknown attacks.
  • Tests on Single-Type Unknown Attacks
To evaluate the performance of the proposed DEMTR model in detecting a single type of unknown attack, a type of unknown attack is added to the known class test set in each round of experiments. Moreover, the accuracy and F1-score are calculated to reflect the unknown class recognition performance. Table 3 displays the experimental results of the proposed DEMTR model and baseline models.
As can be seen from Table 3, the proposed DEMTR model has the best accuracy and F1-score for each class of unknown attacks. Compared with the state-of-the-art malware traffic identification method CNN_LSTM, the unknown class detection performance of the proposed DEMTR model is greatly improved, with the highest increase in accuracy of up to 70%. Nevertheless, class imbalance easily affects the accuracy, and the F1-score can better reflect the overall recognition performance. The F1-score of the DEMTR model is improved by 21% at the highest versus that of CNN_LSTM. This result proves its effectiveness in the unknown attack recognition task. In addition, it is worth mentioning that the CNN_LSTM is a closed-set classification model that misclassifies all the unknown class instances appearing in the testing phase as known classes. Therefore, its effect is far worse than the proposed DEMTR model.
Comparing three open-set models (CNN model, Open-CNN model, and DEMTR model), the DEMTR model has the best recognition performance, the Open-CNN model has the second best, and the CNN model has the worst. The CNN model’s lowest accuracy and F1-score indicate that it is unreasonable to directly use the prediction probability as the condition to judge the unknown class. The explanation is that the neural network based on the softmax function also gives a high confidence value for misclassification and is overconfident in predictions. Open-CNN can identify the unknown class to some extent because it uses the OpenMax layer and outputs the predicted probability of an unknown class.
  • Tests on Multiple Types of Unknown Attacks
To investigate the impact of unknown attacks of different classes on the proposed algorithm, we drew the curve of the F1-score and openness to show the change in the F1-score with the increase in openness. Openness is an important concept in open-set recognition problems that indicates how “open” the problem is. In this experiment, if N and K are used to represent the number of known and unknown classes, the openness can be more accurately expressed as:
o p e n n e s s = 1 2 N 2 N + K
In the experiment, the recognition classifier is trained by the known class training set, followed by adding the unknown classes gradually to the known class test set for testing. As there are 10 unknown classes, K ranges from 0 to 10, with a larger value implying a larger openness. For each open point, K new classes are randomly selected from the unknown class test set, and the final F1-score is calculated by averaging ten random selections.
Figure 5 demonstrates that the proposed DEMTR model achieves the best performance. According to Figure 5, when the openness is 0 (i.e., no unknown classes have been added yet), the F1-scores of the CNN model and the CNN_LSTM model exceed 95%. It indicates that the traditional malware traffic recognition model based on deep learning can achieve good performance when the test data does not contain any unknown class. However, once unknown classes are added to the test set, the recognition performances of CNN and LSTM rapidly deteriorate. As the proportion of unknown classes in the test set gradually increases with the increase in openness, the F1-score curves of the four models all decline. However, the proposed method has the smallest decrease, and the gap with other comparison methods keeps increasing, proving its robustness in detecting unknown classes. It is worth noting that the closed-set accuracy of Open-CNN is considerably lower than that of the other methods. The reason is that Open-CNN directly modifies the activation layer vector and adds a new class named unknown in inference, which may negatively affect the accurate prediction of known class instances.

5. Conclusions

In this study, an uncertainty-aware open-set recognition method for malware traffic is proposed to solve the problem that traditional closed-set traffic classification models misclassify unknown classes with high confidence. First, using the original traffic data as features for malicious traffic identification can retain all the feature information of the original packets and avoid information loss caused by manual feature extraction. Next, the Deep Evidence Malware Traffic Recognition model (DEMTR) is built to quantify both multi-classification probability and predictive uncertainty. Predictive uncertainty is used to distinguish unknown samples from known samples. The efficiency of the proposed method is verified experimentally on real traffic datasets. The experimental results prove that the predictive uncertainty generated by DEMTR can better reflect the credibility of predictions, contributing to detecting samples with semantic shifts. Overall, the proposed method adapts well to open-set scenarios while maintaining high performance in traditional closed-set recognition settings.

Author Contributions

Conceptualization, X.L.; methodology, X.L. and J.F.; validation, X.L. and J.X.; data curation, X.L. and R.W.; writing—original draft preparation, X.L.; writing—review and editing, X.L., D.L., and Z.Q.; visualization, X.L. and H.J.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Project of China, grant number 2019QY1302.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to express their gratitude to EditSprings (https://www.editsprings.com/ (accessed on 16 December 2022)) for the expert linguistic services provided.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xu, X.; Hu, Q.; Jie, Z. Cyber Intrusion Detection Based on a Mutative Scale Chaotic Bat Algorithm with Backpropagation Neural Network. Secur. Commun. Netw. 2022. [Google Scholar] [CrossRef]
  2. Rezaei, S.; Liu, X. Deep learning for encrypted traffic classification: An overview. IEEE Commun. Mag. 2019, 57, 76–81. [Google Scholar] [CrossRef] [Green Version]
  3. Este, A.; Gringoli, F.; Salgarelli, L. Support vector machines for TCP traffic classification. Comput. Netw. 2009, 53, 2476–2490. [Google Scholar] [CrossRef]
  4. Zhang, J.; Chen, X.; Xiang, Y.; Zhou, W.; Wu, J. Robust network traffic classification. IEEE/ACM Trans. Netw. 2014, 23, 1257–1270. [Google Scholar] [CrossRef]
  5. Feng, Y.Y. Research of Network Intrusion Detection Methods. Ph.D. Thesis, North University, Taiyuan, China, 2021. [Google Scholar]
  6. Rudd, E.M.; Rozsa, A.; Gunther, M.; Boult, T.E. A survey of stealth malware attacks, mitigation measures, and steps toward autonomous open world solutions. IEEE Commun. Surv. Tutor. 2016, 19, 1145–1172. [Google Scholar] [CrossRef]
  7. Steve, C.; Coleman, C.; Rudd, E.M.; Boult, T.E. Open set intrusion recognition for fine-grained attack categorization. In Proceedings of the 2017 IEEE International Symposium on Technologies for Homeland Security (HST), Waltham, MA, USA, 25–26 April 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
  8. Zhang, Y.; Niu, J.; Guo, D.; Teng, Y.; Bao, X. Unknown network attack detection based on open set recognition. Procedia Comput. Sci. 2020, 174, 387–392. [Google Scholar] [CrossRef]
  9. Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
  10. Scheirer, W.J.; Rocha, A.; Sapkota, A.; Boult, T.E. Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 1757–1772. [Google Scholar] [CrossRef] [PubMed]
  11. Scheirer, W.J.; Jain, L.P.; Boult, T.E. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2317–2324. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Jain, L.P.; Scheirer, W.J.; Boult, T.E. Multi-class open set recognition using probability of inclusion. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014. [Google Scholar]
  13. Bendale, A.; Boult, T.E. Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  14. Ge, Z.; Demyanov, S.; Garnavi, R. Generative openmax for multi-class open set classification. arXiv 2017, arXiv:1707.07418. [Google Scholar]
  15. Ditria, L.; Meyer, B.J.; Drummond, T. OpenGAN: Open set generative adversarial networks. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020; 2020. [Google Scholar]
  16. Dmitri, B.; Shapira, B.; Rokach, L.; Bar, A. Unknown malware detection using network traffic classification. In Proceedings of the 2015 IEEE Conference on Communications and Network Security (CNS), Florence, Italy, 28–30 September 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
  17. Niyaz, Q.; Sun, W.; Javaid, A.; Alam, M. A deep learning approach for network intrusion detection system. In Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), New York, NY, USA, 24 May 2016. [Google Scholar]
  18. Chen, Y.; Li, Z.; Shi, J.; Gou, G.; Liu, C.; Xiong, G. Not afraid of the unseen: A siamese network based scheme for unknown traffic discovery. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  19. Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Zhu, X.X. A survey of uncertainty in deep neural networks. arXiv 2021, arXiv:2107.03342. [Google Scholar]
  20. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 15 March 2017. [Google Scholar]
  21. Sensoy, M.; Kaplan, L.; Kandemir, M. Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3 December 2017. [Google Scholar]
  22. Anderson, J.P. Technical Report: Computer Security Threat Monitoring and Surveillance; James P. Anderson Company: Washington, DC, USA, 1980. [Google Scholar]
  23. Moon, J.; Kim, J.; Shin, Y.; Hwang, S. Confidence-aware learning for deep neural networks. In Proceedings of the International Conference on Machine Learning, ICML, Vienna, Austria, 12–18 July 2020. [Google Scholar]
  24. Chen, M.H.; Zhu, Y.F.; Lu, B.; Zhai, Y.; Li, D. Classification of application type of encrypted traffic based on Attention-CNN. Comput. Sci. 2021, 48, 325–332. [Google Scholar]
  25. Audun, J. Subjective Logic: A Formalism for Reasoning under Uncertainty; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  26. Mukhoti, J.; Kulharia, V.; Sanyal, A.; Golodetz, S.; Torr, P.; Dokania, P. Calibrating deep neural networks using focal loss. In Proceedings of the Advances in Neural Information Processing Systems 33, Seattle, Washington, USA, 6–12 December 2020; pp. 15288–15299. [Google Scholar]
  27. Krishnan, R.; Tickoo, O. Improving model calibration with accuracy versus uncertainty optimization. In Proceedings of the Advances in Neural Information Processing Systems, Seattle, WA, USA, 6–12 December2020; pp. 18237–18248. [Google Scholar]
  28. Malware Capture Facility Project. Available online: https://www.stratosphereips.org/datasets-malware (accessed on 16 December 2022).
  29. Buczak, A.L.; Guven, E. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 2015, 2022, 1153–1176. [Google Scholar] [CrossRef]
  30. Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  31. Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
  32. Zhang, Y.; Chen, X.; Jin, L.; Wang, X.; Guo, D. Network intrusion detection: Based on deep hierarchical network and original flow data. IEEE Access 2019, 7, 37004–37016. [Google Scholar] [CrossRef]
Figure 1. The framework of the proposed method.
Figure 1. The framework of the proposed method.
Electronics 12 00323 g001
Figure 2. The structure of the DEMTR model.
Figure 2. The structure of the DEMTR model.
Electronics 12 00323 g002
Figure 3. Examples of the Dirichlet distribution (triple classification is taken as an example, where the sample label is the first category). The Dirichlet distribution is different when the predictions of model are (a) accurate and certain (AC), (b) accurate and uncertain (AU), (c) inaccurate and certain (IC), (d) inaccurate and uncertain (IU).
Figure 3. Examples of the Dirichlet distribution (triple classification is taken as an example, where the sample label is the first category). The Dirichlet distribution is different when the predictions of model are (a) accurate and certain (AC), (b) accurate and uncertain (AU), (c) inaccurate and certain (IC), (d) inaccurate and uncertain (IU).
Electronics 12 00323 g003
Figure 4. Histogram of uncertainty distribution for (a) BNN SVI, (b) MC Dropout and (c) the proposed DEMTR.
Figure 4. Histogram of uncertainty distribution for (a) BNN SVI, (b) MC Dropout and (c) the proposed DEMTR.
Electronics 12 00323 g004
Figure 5. The relationship between F1-score and openness.
Figure 5. The relationship between F1-score and openness.
Electronics 12 00323 g005
Table 1. The details of known and unknown attacks used in experiments.
Table 1. The details of known and unknown attacks used in experiments.
DatasetKnown AttacksNumberUnknown AttacksNumber
MCFPDridex54331njRAT51324
Sathurbot55402HTBot52069
TrickBot51489Hancitor5421
Emotet43913CoinMiner53461
Trojan_Downloader52645WebCompanion23451
Locky49708WannaCry44290
Trojan_Dynamer65029Simda34162
Mirai54994Miuref17177
Sality51034neeris35790
Tinba58038Vawtrak46977
total536583total364122
D t r : D v a : D t e = 8 : 1 : 1 D t e
Table 2. The results of hyper-parameter experiments.
Table 2. The results of hyper-parameter experiments.
n 1 n 2
100300500700900110013001500
546.5051.1946.8546.5948.8253.3547.1954.04
1057.7857.7761.3955.4155.8559.4559.6351.08
1549.9760.5764.8560.7364.9957.6157.4059.13
2059.8158.8066.5466.2455.1465.1665.6757.04
2548.4162.8864.7665.2152.4960.8155.5559.22
3056.1760.2064.3165.3561.7164.6164.0572.43
Table 3. Performance comparison between the proposed DEMTR model and the baselines in detecting single-type unknown attacks.
Table 3. Performance comparison between the proposed DEMTR model and the baselines in detecting single-type unknown attacks.
Unknown AttackAccuracy (%)F1-Score (%)
CNNCNN_LSTMOpen-CNNDEMTRCNNCNN_LSTMOpen-CNNDEMTR
njRAT50.0050.1952.9472.5068.2870.4471.1681.44
HTBot49.6549.8351.7171.1466.7867.8665.9978.28
Hancitor88.8589.1888.8589.8585.5085.6885.9190.18
CoinMiner49.0149.1857.3883.7369.7471.0273.7985.65
WebCompanion68.0868.3371.2184.3977.9378.4379.0587.97
WannaCry53.5953.7953.5790.4477.1279.1880.8590.57
Simda59.7859.9972.9873.4674.3375.1784.1885.25
Miuref74.1174.3878.6279.1778.8280.4383.6884.69
neeris58.6958.9063.3464.8676.0475.5679.7780.16
Vawtrak52.1652.3552.4669.3372.0772.0172.5778.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Fei, J.; Xie, J.; Li, D.; Jiang, H.; Wang, R.; Qi, Z. Open Set Recognition for Malware Traffic via Predictive Uncertainty. Electronics 2023, 12, 323. https://doi.org/10.3390/electronics12020323

AMA Style

Li X, Fei J, Xie J, Li D, Jiang H, Wang R, Qi Z. Open Set Recognition for Malware Traffic via Predictive Uncertainty. Electronics. 2023; 12(2):323. https://doi.org/10.3390/electronics12020323

Chicago/Turabian Style

Li, Xue, Jinlong Fei, Jiangtao Xie, Ding Li, Heng Jiang, Ruonan Wang, and Zan Qi. 2023. "Open Set Recognition for Malware Traffic via Predictive Uncertainty" Electronics 12, no. 2: 323. https://doi.org/10.3390/electronics12020323

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop