Next Article in Journal
Bit-Weight Adjustment for Bridging Uniform and Non-Uniform Quantization to Build Efficient Image Classifiers
Previous Article in Journal
Electrochemical Impedance Spectrum Equivalent Circuit Parameter Identification Using a Deep Learning Technique
Previous Article in Special Issue
A Novel Cloud Enabled Access Control Model for Preserving the Security and Privacy of Medical Big Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework

1
Department of Computer Science, University of West Florida, Pensacola, FL 32514, USA
2
Department of Mathematics and Statistics, University of West Florida, Pensacola, FL 32514, USA
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(24), 5039; https://doi.org/10.3390/electronics12245039
Submission received: 1 November 2023 / Revised: 14 December 2023 / Accepted: 14 December 2023 / Published: 18 December 2023
(This article belongs to the Special Issue Security and Privacy Issues and Challenges in Big Data Era)

Abstract

:
This study introduces UWF-ZeekDataFall22, a newly created dataset labeled using the MITRE ATT&CK framework. Although the focus of this research is on classifying the never-before classified resource development tactic, the reconnaissance and discovery tactics were also classified. The results were also compared to a similarly created dataset, UWF-ZeekData22, created in 2022. Both of these datasets, UWF-ZeekDataFall22 and UWF-ZeekData22, created using Zeek Conn logs, were stored in a Big Data Framework, Hadoop. For machine learning classification, Apache Spark was used in the Big Data Framework. To summarize, the uniqueness of this work is its focus on classifying attack tactics. For UWF-ZeekdataFall22, the binary as well as the multinomial classifier results were compared, and overall, the results of the binary classifier were better than the multinomial classifier. In the binary classification, the tree-based classifiers performed better than the other classifiers, although the decision tree and random forest algorithms performed almost equally well in the multinomial classification too. Taking training time into consideration, decision trees can be considered the most efficient classifier.

1. Introduction

Due to the remarkable advancements in technology and increase in the number of internet users in recent times, there has been an unprecedented surge in the amount of data generated and exchanged over the internet. Millions of financial transactions are being conducted on the internet and cloud-based systems store a plethora of sensitive data, including personal, health, and other sensitive data. Given the ever-increasing magnitude of data collected and utilized, ensuring robust security measures and safeguarding against unauthorized network intrusions have become equally imperative. Also, due to the rise in network attacks, it has become very important to detect intrusions (attacks) before they happen rather than after they happen. Therefore, in order to develop stronger network intrusion systems that detect attacks before they happen, it is important to understand what an adversary is planning and how the adversary is planning to attack. The MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) framework defines tactics as being the motives of the adversary. The MITRE ATT&CK framework, which is a foundation of threat models and methodologies, is based on 14 tactics and several techniques belonging to more than one tactic [1]. Hence, data labeled as per the MITRE ATT&CK framework help us study the various tactics of the adversary. This research classifies the data that lead up to three adversary tactics, defined and labeled as per the MITRE ATT&CK framework: reconnaissance (TA0043) [2], discovery (TA0007) [3], and resource development (TA0042) [4]. A reconnaissance tactic is a phase in which an adversary attempts to gather information about a target in order to attack it [2]. A discovery tactic is a phase in which an adversary attempts to discover the target’s system resources and vulnerabilities [3]. Resource development tactics are when an adversary uses techniques to compromise resources like infrastructure, accounts, or other capabilities that can be used to support targeting [4]. Although there has been some classification work on the reconnaissance and discovery tactics [1], there has been no classification work as of yet on the resource development tactic.
This research classifies these tactics using two different datasets, UWF-ZeekData22 [5], created in 2022, and UWF-ZeekDataFall22 [5], created in 2023, both created using Zeek Conn logs. Due to the amounts of data involved, both datasets are stored in a Big Data framework, Hadoop, which is a distributed file system (Hadoop Distributed File System, HDFS). The HDFS is a scalable fault-tolerant system that allows the storage and processing of Big Data across a cluster of computers [6]. For machine learning (ML) classification, Apache Spark, an open-source computing framework that sits on top of Hadoop, is used. Spark is an in-memory cluster computing framework for processing and analyzing large amounts of data. Key features of Spark are that it is fast, scalable, and fault tolerant [7]. For the classification and comparative analysis of the performance of these two datasets, several machine learning (ML) libraries of Apache Spark are used in this work.
To summarize, the uniqueness of this work is its focus on classifying tactics rather than attacks. This work focuses on classifying the resource development tactic of the MITRE ATT&CK framework; the resource development tactic has not been previously analyzed or classified. Second, this classification is performed on a newly created dataset, UWF-ZeekDataFall22, uniquely created using Zeek and an open-source traffic analyzer [8] and labeled using the MITRE ATT&CK framework. To optimize the workings of Spark, this study also looks at the effect of the executor count, executor core count, total executor cores, executor memory, and total executor memory allocations of Spark. Furthermore, in addition to classifying the tactics of UWF-ZeekDataFall22, this study also compares the results with a similarly created dataset, UWF-ZeekData22.
This paper is organized as follows: Section 2 covers related work; Section 3 explains the data used in this work; Section 4 describes the experimentation, that is, everything from the binning to calculating the information gain; Section 5 presents the machine learning algorithms used in this work; and Section 6 presents the results. Section 7 presents a summary of the key finding and Section 8 presents the conclusions.

2. Related Work

Many experiments have been carried out in the cybersecurity field to determine the best classifier for detecting network attacks. Mebawondu et al. (2022) [9] compared the effectiveness of naïve Bayes with the C4.5 decision tree algorithm using the UNSW-NB15 dataset. In this work, the data pre-processing stage employed min–max normalization to ensure that feature values fell inside a predetermined range. They also used information gain and gain ratio algorithms to extract relevant features. In this work, the decision tree algorithm was found to perform better than naïve Bayes.
Panda and Patra (2007) [10] used the KDD’99 dataset, a subset of the DARPA dataset. Only 10% of the dataset was used in their experiment, and naïve Bayes was used for classification. The final results were compared to backpropagation neural network (BPN) results for the same dataset. Naïve Bayes was, surprisingly, found to have a higher detection rate and took less time than the BPN, but it produced more false positives.
Tufail et al. (2022) [11] compared the performance of shallow neural networks (SNN) and logistic regression in predicting DDoS assaults. They used “Divide and Randomized Conquer” to create datasets of 100 k and 1000 k packets from their initial raw dataset, which had approximately 12 million packets and 84 characteristics. It was found that the accuracy of logistic regression was slightly lower than the accuracy of SNN; however, the training time for SNN was longer than that for logistic regression.
In another study by Kejriwal et al. (2022) [12], the intrusion detection accuracy of different algorithms like logistic regression, random forest, KNN, XGBoost, Gaussian naïve Bayes, and a multi-layer perceptron classifier (MLP) were compared. The dataset used in their study was CIC-IDS2017. Of all the ML algorithms, the random forest algorithm was found to have the best results. Logistic regression was found to perform better than naïve Bayes, which had the lowest performance of all.
In yet another comparative study, Disha and Waheed (2021) [13] used the GBT classifier for binary classification to determine network intrusions using the UNSW-NB15 dataset. A Chi-squared test was used to remove irrelevant features, and GBT was discovered to have the highest accuracy after decision trees.
In Swamy et al. (2012) [14], the decision tree (DT) classifier and min–max normalization algorithms were used as part of the intrusion-detection process using the KDDP99 dataset. The DT classifier had good accuracy, with lower false positive and true negative rates.
In another study, Mulay et al. (2010) [15] compared the performance of the decision tree classifier with support vector machines (SVMs). The accuracy of decision trees was found to be high, and their training and testing times were low compared to SVMs.
Jha and Ragha (2013) [16] used the SVM classifier on the NSL-KDD dataset, which consists of selected records from the KDD99 dataset. The information gain algorithm was used in their study to extract relevant features. It was found that the reduced dataset increased the detection rate of the SVM and also reduced the training and testing times.
Belouch et al. (2018) [17] compared four well-known classification algorithms, SVM, naïve Bayes, decision tree, and random forest, using Apache Spark with the UNSW-NB15 dataset. They found that random forest gave the best performance followed by decision tree and naïve Bayes.
All the above works performed classification on network attacks. There are no works on the classification of tactics, which is the novelty of our work. Moreover, none of the above works performed classification on tactics using the two new uniquely created datasets, UWF-ZeekData22 and UWF-ZeekDataFall22.

3. The Datasets

The Zeek Conn log MITRE ATT&CK framework labeled datasets, UWF-ZeekData22 and UWF-ZeekDataFall22, available in [5], generated using the Cyber Range at the University of West Florida (UWF), were used for this analysis. UWF-ZeekData22, created in 2022, had 9,280,869 attack records and 9,281,599 benign records [18], and a breakdown of this dataset’s tactics is also presented in Bagui et al. (2023) [18]. The new dataset, UWF-ZeekDataFall22, created in 2023, has 350,001 attack records and 350,339 benign records [5]. The breakdown of the attack tactics in the UWF-ZeekDataFall22 dataset is presented in Table 1.

3.1. Tactics in the UWF-ZeekDataFall22 Dataset

This dataset had the highest number of resource development tactics followed by reconnaissance and discovery; hence, in this work, we focus on these three tactics.
The resource development tactic encompasses multiple techniques, with a focus on obtaining resources to help support targeting. These techniques range from creating, purchasing, or illicitly obtaining resources to including, but not limited to, infrastructure, accounts, or capabilities [4].
The reconnaissance tactic involves various techniques that are conducted to facilitate both active and passive information gathering for target development. Techniques of reconnaissance could involve trying to identify potential attack surfaces and entry points by scanning ports, mapping the topology of the network, and researching the target’s online digital footprint [2].
The discovery tactic involves various directed techniques meant to determine network infrastructure details like identifying the exact services and versions running on remote hosts or collecting technical details about the local network architecture. Going beyond the reconnaissance tactic, the discovery tactic is set on gathering precise information about the target that can be used to plan and execute specific attacks. It allows attackers or security professionals to identify vulnerabilities, misconfigurations, and targets that can be exploited [2].

3.2. The Zeek Conn Log Files of the UWF-ZeekDataFall22 Dataset

The Zeek Conn log files of the UWF-ZeekDataFall22 dataset were used in this work. The Zeek Conn log files, explained in [5,19], track the protocols and associated information such as IP addresses, duration, two-way bytes, states, packets, and tunnel information. In short, the Conn log files provide all the data regarding the connection between two points. The full list of the attributes of the Conn log files is available in [5,19]. The “mitre_attack” attribute was added to label the data.

4. Experimentation

The datasets, stored in parquet format on HDFS, were read into Spark’s data frame for processing. Spark’s data frame, organized in the form of rows and columns, allows for the easier processing of large amounts of data.

4.1. Overview of Processing UWF-ZeekDataFall22

First, the input dataset, UWF-ZeekDataFall22, was binned. Since this dataset has several continuous attributes, these attributes were binned to smoothen the data and remove noise. Also, binning was used to prepare the data for algorithms that require discrete values, like decision trees and random forest algorithms as well as gradient-boosting trees. Binning also helps to address any over-fitting issue [20].
Figure 1 presents the overview of the experimentation. As presented in Figure 1, the binned data were then used to calculate information gain. Information gain was used to determine the relevance of the features or attributes. Machine learning (ML) using Spark was performed on this preprocessed dataset, and Spark’s optimum parameters were determined before ML was used. For the ML algorithms, the input dataset was divided into training and testing sets with a ratio of 80:20. Apache Spark’s machine learning libraries for naïve Bayes, logistic regression, random forest, gradient-boosting trees, support vector machines, and decision trees were used to train and detect the network tactics. To assess the effectiveness of the machine learning classifiers, various evaluation metrics, such as accuracy, precision, and recall, were utilized for comparison.

4.2. Preprocessing Using Binning

From the Zeek Conn log files of the UWF-ZeekDataFall22 dataset, the following features were binned:
  • dest_ip and src_ip for IP addresses;
  • dest_port and src_port for port numbers;
  • local_orig and local_resp, which are Boolean data types;
  • protocol, conn_state, history, and service, all of which are nominal attributes;
  • duration, orig_bytes, orig_pkts, orig_ip_bytes, resp_bytes, resp_pkts, resp_ip_bytes, and missed_bytes, which are all continuous valued attributes.

4.2.1. Binning IP Addresses

For binning the IP addresses, the commonly recognized network classifications [21] of A, B, C, D, and E were used, each of which pertains to specific ranges of the first octet in the IP address. Null and non-applicable values are assigned a value of 0, 0–127 octet values are assigned a value of 1, and so on. Table 2 presents the classification after binning the IP addresses.

4.2.2. Binning Port Numbers

Port numbers were binned following the Internet Assigned Numbers Authority (IANA) [22], administering ports 0 through 65,535. Binning was performed as per well-known ports, registered ports, and dynamic/private ports, as shown in Table 3. Table 3 presents the binning of port numbers in UWF-ZeekDataFall22.

4.2.3. Binning Booleans and Nominal Attributes

Booleans and nominal values are the non-numeric data in the dataset. The StringIndexer method from MLib [23], which is Apache Spark’s scalable machine learning library, was used to convert the non-numeric values into numbers. This method also converts the invalid or null values to integer values for binning. The columns binned using this algorithm were local_orig, local_resp, protocol, conn_state, history, and service, as presented in Table 4. Table 4 shows the number of bins generated for each specific attribute.

4.2.4. Binning Continuous Valued Attributes

Continuous values were binned as per [19]. Null values were dropped from the columns and then a 10% trim was performed from both ends. The trimmed version was used to calculate the mean and standard deviation, and the edges of the bins were generated as per the Algorithm 1, adopted from [19]:
Algorithm 1 Binning Continuous Valued Attributes
     if min_val >= 0:
          while mean_val − 2 * stddev_val < 0:
                mean_val += stddev_val
     edge0 = float(‘-inf’)
     edge1 = mean_val − stddev_val * 2
     edge2 = mean_val − stddev_val
     edge3 = mean_val
     edge4 = mean_val + stddev_val
     edge5 = mean_val + stddev_val * 2
     edge6 = float(‘inf’)
     edges = [edge0, edge1, edge2, edge3, edge4, edge5, edge6]
     edges_distinct = []
The Bucketizer function in Pyspark was used to generate bins using the edges. To maintain the desired number of bins and to avoid redundant bin ranges, the moving-mean logic, presented in Algorithm 2, was inserted during the establishment of the edges, adopted as per [19]:
Algorithm 2 Moving-mean Logic
     if min_val >= 0:
          while mean_val – 2 * stddev_val < 0:
                mean_val += stddev_val
The numbers of bins generated for all continuous valued attributes are presented in Table 5.

4.3. Information Gain

After the binning process was completed, information gain was used to assess the relevance of each of the 18 features from the Zeek Conn Log files of the UWF-ZeekDataFall22 dataset using the binned data. The information gain algorithm was used to extract the relevant attributes.
Information gain is the difference between a class’s entropy and the entropy of the class and a selected feature split, with entropy measuring the extent of randomness in the dataset [24]. It is an assessment of the usefulness of a feature in the classification.
The following calculations [24] were performed on each attribute to produce information gain values for ranking purposes:
G a i n ( A ) = I n f o D I n f o A ( D )
where
I n f o D = i = 1 m p i l o g 2 ( p i )
I n f o A D = j = 1 V D j D × I n f o D j
where
  • Info(D) is the average amount of information needed to identify the class level of a tuple in D;
  • InfoA(D) is the expected information required to classify a tuple from D based on the partitioning from A;
  • pi is the non-zero probability that an arbitrary tuple belongs to a class;
  • |Dj|/|D| is the weight of the partition.
Information gain was calculated for the full dataset and the three tactics, resource development, discovery, and reconnaissance, individually. The information gain values for the attributes in the Zeek Conn logs using the full dataset are presented in Table 6, and for resource development, discovery, and reconnaissance in Table 7, Table 8, and Table 9, respectively. An analysis of the results in the different information gain tables shows that the patterns of the attributes are very similar, although a couple of attributes might have been flipped. The last three attributes (attributes with an information gain value of zero) were the same in all cases.

5. Machine Learning Algorithms

Based on a review of the literature on the most commonly used machine learning algorithms for classification analysis and comparison, the following supervised machine learning algorithms were used in this work: decision tree, support vector machine, random forest, naïve Bayes, logistic regression, and gradient-boosting tree.

5.1. Decision Trees

A decision tree (DT) algorithm follows a tree-like model. At the root of the tree is the attribute or feature with the highest information gain. The next level of the tree is determined by the attribute with the next-highest information gain, and so on. Thus, the algorithm works by recursively splitting the data into subsets based on the most significant feature at each node of the tree.

5.2. Support Vector Machines

Support Vector Machines (SVM) work by mapping data to a high-dimensional feature space so that data points can be categorized even when the data are not otherwise linearly separable. A separator between the categories is found. Then, the data are transformed in such a way that the separator can be drawn as a hyperplane which separates the data into classes.

5.3. Random Forest

Random forest (RF) algorithms grow multiple decision trees, which are merged together for a more accurate prediction. The logic behind the random forest model is that multiple uncorrelated models (the individual decision trees) perform better as a group. When using the random forest for classification, each tree gives a classification or “vote”; the forest chooses the classification with the majority “votes”.

5.4. Naïve Bayes

Naïve Bayes (NB), which is probabilistic in nature and based on the Bayes Theorem, is commonly used for classification tasks. It is based on two key assumptions: that the features are independent of one another, and that each feature contributes equally to the outcome.

5.5. Logistic Regression

Logistic regression is a statistical model used for binary classification, which predicts the probability of an object belonging to one of two groups based on input variables. It assumes a linear relationship between features and can handle both continuous and discrete variables.

5.6. Gradient-Boosting Trees

Gradient-boosting trees are a popular form of classification algorithms due in part to their ease of interpretation. Gradient-boosting trees utilize a large number of trees with individual nodes coming off each for various dataset segmentations, often starting with the most informative attribute. The gradient aspect stems from putting a higher emphasis on the more accurate trees and associated nodes and bases. When coupled together, this provides an extensive review of large quantities of data with a clear delineation of where the most accurate pathway is through the data [25].

6. Results

The first step was to determine the best set of configuration parameters for Spark in relation to the UWF-ZeekDataFall22 dataset. Since DT is a robust classifier [26], the Spark configuration parameters were tested using the DT classifier. Finally, all six classifiers were run based on the best set of configuration parameters for Spark.

6.1. Determining Spark’s Best Configuration Parameters

To determine Spark’s optimum parameters, using all 18 attributes from the Conn log files of the UWF-ZeekDataFall22 dataset, the DT classifier was used with the following Spark configuration parameters: executor count, executor core count, total executor cores, executor memory, and total executor memory.
Executor count defines the number of executors available for each node [23].
Executor core is the computational power of the CPU. This parameter defines the number of cores available for each executor and the number of concurrent tasks that can be run on each executor [23].
Spark’s executor memory is the amount of memory provided for each executor to complete the task. Defining the executor memory will control the executor heap size and reduce the garbage collection delay [23].
Driver core is the main logic that triggers a Spark job. It is responsible for submitting the job and coordinating the execution of Spark application across the cluster nodes. By default, the number of cores available for a driver program is 1. Using the –driver-cores parameter, we can set up the number of Spark driver cores required for the Spark application [23].
Driver memory is the amount of memory allocated to the driver and basically depends on the number of times we retrieve the data and ranges from 2 GB to 4 GB [23].
Shuffle partition configures the number of partitions to be allocated for shuffling the data while performing RDD operations, such as joins and aggregations. This parameter has no effect when the Spark application has only DataFrame operations [23].
Table 10 shows the impact of varying these parameters on the binning time, training time, and testing time (in seconds).
As can be seen from Table 10, Test 19’s Spark parameters performed the best in terms of binning time, training time, and testing time. Table 11 shows the effect of varying the driver cores, driver memory, and shuffle partitions on the total time taken by the decision tree classifier. The total time is the sum of binning, training, and testing time. Here, too, Test 19 performed the best.
Figure 2 and Figure 3 show the effects of total executor memory (in GBs) and total number of executor cores on the total processing time (in seconds), respectively. According to Figure 2, there does not appear to be a strong correlation between total executor memory and total processing time. In most runs, increasing the executor memory from 20 to 120 did not reduce the processing time significantly. However, as per Figure 3, the total processing time decreased when the total number of executor cores increased from 10 to 96. The total number of executor cores can be obtained by multiplying the total number of executors with the number of cores per executor.
Figure 4 shows how the total number of executor cores and the number of executors affect the processing time. The size of a bubble indicates the processing time; the smaller the bubble size, the shorter the total processing time. As the total number of executor cores is increased, the bubbles become smaller. The grey bubbles, for example, are large and closer to the x-axis when the number of executor cores is low, indicating higher processing times. The processing time was the shortest for the red bubble, which contained 96 total executor cores and 12 executors.
According to Figure 5, the processing time increases as the shuffle partitions increase. This could be due to the overhead associated with distributing and collecting data over a network. From Figure 4, it can be seen that the processing time was lowest for the red bubble, with 24 shuffle partitions and 12 executors. Hence, Test 18’s Spark configuration parameters from Table 1 were used for testing the classifiers, since this allows for the use of fewer partitions.

6.2. Analyzing Training Time

The DT classifier was run for 6, 9, 12, and 18 attributes, and the training times were recorded as shown in Figure 6. The six attribute runs were run using the top six attributes from the respective information gain tables, where nine attributes would be the top nine attributes from the respective information gain tables, and so on. There was no significant difference between the number of attributes used with respect to the training time. The training time for resource development was high because of the larger number of records (82,543). The training times for reconnaissance and discovery, however, were low, since there were fewer records.

6.3. Performance of the Machine Learning Classifiers Using UWF-ZeekDataFall22

Six different machine learning classifiers, decision tree, random forest, naïve Bayes, logistic regression, support vector machines, and gradient-boosting trees, were used for classifying the data. Testing was performed on two datasets, UWF-ZeekDataFall22 and UWF-ZeekData22, both available in [12]. The results of the binary classification are presented in Table 12, Table 13 and Table 14. The results of the multinomial classification are presented in Table 15. The multinomial classification was only performed for the UWF-ZeekDataFall22 dataset.

6.3.1. Evaluation Metrics

Classifier performance was measured in terms of accuracy; precision; recall; false positive rate; f-measure; preprocessing time, in terms of binning time; training time; and testing time. Below are some of the commonly used terminologies used in the evaluation metrics:
True Positive (TP): Number of correct positive predictions.
True Negative (TN): Number of correct negative predictions.
False Positive (FP): Number of negative predictions which were incorrectly classified as positive.
False Negative (FN): Number of positive predictions which were incorrectly classified as negative.
Accuracy: Accuracy is the number of correct predictions divided by the total number of predictions:
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Precision: Precision is the number of correct positive predictions divided by the total number of positive predictions:
Precision = TP/(TP + FP)
Recall: Recall can be also defined as the true positive rate. It can be calculated as the number of true positives divided by all real positives:
Recall = TP/(TP + FN)
False Positive Rate (FPR): The FPR is the total number of incorrect positive predictions divided by all real negatives:
FPR = FP/(TN + FP)
F-Measure: The F-Measure is the harmonic mean of precision and recall:
F-Measure= 2 * (Precision * Recall)/(Precision + Recall)
Training Time: The time taken by a classifier to train, in seconds.
Testing Time: The time taken by the classifier to perform the predictions, in seconds.

6.3.2. Machine Learning Classifier Results for Binary Classification

Table 12, Table 13 and Table 14 present the binary classification results for all six classifiers using the UWF-ZeekDataFall22 dataset for the reconnaissance, discovery, and resource development tactics, respectively. Binary classification implies using a combination of the respective tactic data with the benign data.

The Reconnaissance Tactic

The results of the binary classification for the reconnaissance tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 12.
From Table 12, it can be noted that RF, GBT, and DT performed the best, for all combinations of attributes, for accuracy, precision, recall, and the F-measure (all at 100%), although the results of accuracy, precision, and recall were also very close for NB and LR. Except for 18 attributes, SVM did not perform as well, especially for 12 attributes. SVM performed quite poorly in terms of precision, recall, and the F-measure too.
The false positive rates were also relatively high for SVM, especially for 12 attributes, as shown in Figure 7. Again, RF, GBT, and DT have a 0.00% FPR, and LR and NB have a FPR slightly higher than 0.00% (as shown in Table 12 and Figure 7).
In terms of training time, as shown in Figure 8 and Table 12, SVM had a higher overall training time; the second highest was GBT followed by RF and then DT. NB had the lowest training time.
Figure 9 shows the averages for all algorithms by the number of features for the reconnaissance tactic. From Figure 9, the number of features does not seem to have an effect on accuracy, precision, recall, or the F-measure.

The Discovery Tactic

Results of the binary classification for the discovery tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 13.
From Table 13, it can be noted that GBT and DT performed the best, for all combinations of attributes, for accuracy, precision, recall, and the F-measure, although the results of accuracy, precision, and recall were also very close for RF and LR followed by NB. SVM performed somewhat randomly. SVM performed almost as well for 18 attributes, moderately well for nine attributes, and the worst for 12 attributes.
In terms of recall, all algorithms performed well for all number of attributes except for SVM. In fact, SVM with 12 attributes performed very poorly.
In terms of FPRs, GBT and DT had a rate of 0.00% for all attribute combinations, as shown in Figure 10 as well as in Table 13. Other FPRs were also relatively low except for SVM with six attributes, which had a FPR of 11%.
In terms of training time, SVM had the highest overall training time followed by GBT and then RF, as shown in Figure 11 as well as in Table 13. NB had the lowest training time.
Figure 12 shows the averages for all algorithms according to the number of features for the discovery tactic. From Figure 12, it can be noted that six attributes were relatively low in terms of average precision, and 12 and 18 attributes were relatively low in terms of average recall and the F-measure, respectively.

The Resource Development Tactic

The results of the binary classification for the resource development tactic using the six different classifiers, LR, NB, RF, GBT, DT, and SVM, are presented in Table 14.
Both GBT and DT had 100% accuracy, precision, and recall when tested with the UWF-ZeekDataFall22 dataset using the resource development tactic. In terms of accuracy, the results of the LR, NB, and RF were closely behind GBT and DT, but SVM with nine and 12 attributes performed very poorly compared to the rest. The same pattern can be seen for recall, and the FPRs are shown in Figure 13.
In terms of training time, as can be seen in Figure 14, SVM followed, by GBT, showed the poorest level of performance. Here, NB had the overall lowest training time.
Figure 15 shows the average accuracy, precision, recall, and F-measure of all the algorithms with 6, 9, 12, and 18 attributes. It appears that the averages were relatively higher for six attributes, and the other attributes performed almost the same.

6.3.3. Machine Learning Classifier Results for Multinomial Classification Using the UWF-ZeekDataFall22 Dataset

For multi-classification, the data from the three tactics—reconnaissance, discovery, and resource development—and benign data were used. For multi-classification, weighted accuracy, precision, recall, and the F-measure were used due to the imbalanced nature of the data.
Table 15 presents the multi-classification results for the four classifiers, LR, NB, RF, and DT, using the UWF-ZeekDataFall22 dataset with the four different attribute combinations. Currently, the pyspark MLIb does not support multi-class classification for SVM and GBT and, therefore, these algorithms were not included in the testing.
DT and RF had the highest performance in terms of the weighted precision, weighted recall, weighted F-measure, and weighted accuracy as well as FPR. However, the training time was the highest for the RF classifier. NB had the poorest performance in terms of weighted precision, weighted recall, weighted F-measure, and weighted accuracy.
Binary classification outperformed the multi-classification for all classifiers. Naïve Bayes exhibited very poor performance in the multi-class scenario but achieved an average accuracy of 99.3% with binary classification. The RF and DT classifiers demonstrated good performance in both binary and multi-class classification, making them suitable for both types of classification.

6.3.4. Comparing the UWF-ZeekDataFall22 and UWF-ZeekData22 Datasets

This section compares the results of two datasets, UWF-ZeekDataFall22 and UWF-ZeekData22. The results of the UWF-ZeekData22 dataset can be found in a previously published study [19]. Figure 16 and Figure 17 compare the average binary classification accuracy and precision, respectively, for the reconnaissance tactic using the two different datasets. Figure 18 and Figure 19 compare the average binary classification accuracy and precision, respectively, for the discovery tactic using the two different datasets. These are the averages for all four attribute combinations (6, 9, 12, and 18) for all the classifiers.
Although the results were very close for both reconnaissance and discovery, the average accuracy and precision of all the classifiers except SVM were slightly better using the new UWF-ZeekDataFall22 dataset. RF and DT performed the best for both datasets.

6.4. Limitations of This Study

One major limitation of this study is that, since other tactic datasets are not available, only two datasets could be compared.

7. Summary of Key Findings

Several conclusions can be drawn from this paper. First, in terms of testing Spark’s configuration parameters, there does not appear to be a strong correlation between total executor memory and total process time. Also, the processing time increases as the shuffle partitions increase. This would be due to the overhead associated with distributing and collecting data over a network. From Figure 4, it can be seen that the processing time was lowest with 24 shuffle partitions and 12 executors.
In terms of classification results, it can be noted that, overall, the binary classifiers performed better than the multinomial classifiers, though DT and RF performed almost equally well in the binary as well as the multi-class classifications. For both reconnaissance and discovery, the average accuracy and precision of all the classifiers except SVM was better with the new UWF-ZeekDataFall22 dataset. On average, the multinomial classifiers took longer than the binary classifiers. Finally, overall, the tree-based classifiers performed best in both the binary as well as the multi-class classifications.
Of the three tree-based classifiers, random forest, decision tree, and gradient-boosting tree, the decision tree had the shortest training time for binary classification as well as for multinomial classification for all tactics, and this was the case for all four combinations of attributes. Hence, taking training time into consideration, decision trees can be considered to be the most efficient classifier using Apache Spark, for the UWF-ZeekDataFall22 dataset, labeled as per the MITRE ATT&CK framework.
Elaborating on the high results obtained by the tree-based classifiers, which were at 100% or close to 100%, these results could also be attributed to the successful preprocessing, which reduced the noise in the data. However, it can be noted that SVM performed randomly compared to the other classifiers. Using the UWF-ZeekDataFall22 dataset, for all tactics, SVM performed almost as well for 18 attributes as the other classifiers, and moderately well for 6 or 9 attributes, but performed very poorly for 12 attributes. This needs to be examined further in a future study.
In terms of the number of attributes that performed the best, the top six attributes (as per the information gain measure) performed the best using the resource development tactic. The results using the discovery tactic were not conclusive since both 9 and 18 attributes seemed to perform equally well, in which case 9 attributes would be selected for classification. For the reconnaissance tactic, 6 as well as 18 attributes seemed to perform almost equally well, in which case 6 attributes would be selected for the classification.

8. Conclusions

Data labeled as per the MITRE ATT&CK framework, specifically reconnaissance, discovery, and resource development, are classifiable from Zeek Conn logs using machine learners, specifically decision tree, random forest, gradient-boosting tree, logistic regression, SVM, and naïve Bayes algorithms, although some classifiers perform better than others. They are classifiable using both binary as well a multi-classification framework. Since we obtained the best results with tree-based classifiers (DT, RF, and GBT), our results are also congruent with previous work using network attack data, which showns that RF [12,17] or DT performed best [9,13]. If we take training time into consideration, decision trees can be considered the most efficient classifier for the UWF-ZeekDataFall22 dataset, labeled using the MITRE ATT&CK framework.

Author Contributions

Conceptualized, S.S.B., D.M. and S.C.B.; methodology, S.S.B., D.M., S.C.B., P.M. and N.U.; software, P.M. and N.U.; validation, S.S.B., D.M., S.C.B. and M.E.; formal analysis, S.S.B., D.M., S.C.B., P.M. and N.U.; investigation, S.S.B., D.M., S.C.B., P.M. and N.U.; data curation, R.P. and T.M.; writing—original draft preparation, S.S.B., D.M., S.C.B., P.M., N.U., M.E. and S.P.; writing—review and editing, S.S.B., D.M., S.C.B., P.M., N.U., M.E. and S.P.; visualization, P.M. and N.U.; supervision, S.S.B., D.M. and S.C.B.; project administration, S.S.B., D.M. and S.C.B.; funding acquisition, S.S.B., D.M. and S.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Centers of Academic Excellence in Cybersecurity, 2021 NCAE-C-002: Cyber Research Innovation Grant Program, Grant Number H98230-21-1-0170.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. MITRE ATT&CK|MITRE ATT&CK®. Available online: https://attack.mitre.org/# (accessed on 8 August 2023).
  2. MITRE ATT&CK. Reconnaissance, Tactic TA0043—Enterprise|MITRE ATT&CK®. 2022. Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 5 September 2023).
  3. MITRE ATT&CK. Discovery, Tactic TA0007—Enterprise|MITRE ATT&CK®. 2022. Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 5 September 2023).
  4. MITRE ATT&CK. Resource Development, Tactic TA0042—Enterprise|MITRE ATT&CK®. 2022. Available online: https://attack.mitre.org/tactics/TA0042/ (accessed on 8 August 2023).
  5. University of West Florida. 2022. Available online: https://datasets.uwf.edu/ (accessed on 2 September 2023).
  6. Karun, A.K.; Chitharanjan, K. A review on Hadoop—HDFS infrastructure extensions. In Proceedings of the 2013 IEEE Conference on Information and Communication Technologies, Thuckalay, India, 11–12 April 2013. [Google Scholar] [CrossRef]
  7. Guller, M. Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis; Apress: New York, NY, USA, 2015. [Google Scholar]
  8. About Zeek. Available online: https://docs.zeek.org/en/master/about.html (accessed on 2 August 2023).
  9. Mebawondu, O.J.; Popoọla, O.S.; Ayogu, I.I.; Ugwu, C.C.; Adetunmbi, A.O. Network Intrusion Detection Models based on Naives Bayes and C4.5 Algorithms. In Proceedings of the 2022 IEEE Nigeria 4th International Conference on Disruptive Technologies for Sustainable Development (NIGERCON), Lagos, Nigeria, 5–7 April 2022; pp. 1–5. [Google Scholar] [CrossRef]
  10. Panda, M.; Patra, M.R. Network Intrusion Detection Using Naïve Bayes. IJCSNS Int. J. Comput. Sci. Netw. Secur. 2007, 7, 258–263. [Google Scholar]
  11. Tufail, S.; Batool, S.; Sarwat, A.I. A Comparative Study of Binary Class Logistic Regression and Shallow Neural Network for DDoS Attack Prediction. In Proceedings of the SoutheastCon 2022, Mobile, AL, USA, 26 March–3 April 2022; pp. 310–315. [Google Scholar] [CrossRef]
  12. Kejriwal, S.; Patadia, D.; Dagli, S.; Tawde, P. Machine Learning Based Intrusion Detection. In Proceedings of the 2022 IEEE Fourth International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, 10–11 January 2022; pp. 1–5. [Google Scholar] [CrossRef]
  13. Disha, R.A.; Waheed, S. A Comparative study of machine learning models for Network Intrusion Detection System using UNSW-NB 15 dataset. In Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Khulna, Bangladesh, 14–16 September 2021; pp. 1–5. [Google Scholar] [CrossRef]
  14. Swamy, K.V.R.; Lakshmi, K.V. Network Intrusion Detection Using Improved Decision Tree Algorithm. IJCSIT Int. J. Comput. Sci. Inf. Technol. 2012, 3, 4971–4975. [Google Scholar]
  15. Mulay, S.; Devale, P.R.; Garje, G. Intrusion Detection System Using Support Vector Machine and Decision Tree. Int. J. Comput. Appl. 2010, 3, 10–16. [Google Scholar] [CrossRef]
  16. Jha, J.; Ragha, L. Intrusion Detection System using Support Vector Machine. In Proceedings of the IJAIS Proceedings on International Conference and Workshop on Advanced Computing (ICWAC), Mumbai, India, 22–23 February 2013; pp. 25–30. [Google Scholar]
  17. Belouch, M.; El Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Comput. Sci. 2018, 127, 1–6. [Google Scholar] [CrossRef]
  18. Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. [Google Scholar] [CrossRef]
  19. Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef] [PubMed]
  20. Deckert, A.C.; Kummerfeld, E. Investigating the Effect of Binning on Casual Discovery. 2022. Available online: https://arxiv.org/pdf/2202.11789.pdf (accessed on 10 September 2023).
  21. Microsoft. TCP/IP Addressing and Subnetting—Windows Client|Microsoft Docs. 2022. Available online: https://docs.microsoft.com/en-us/troubleshoot/windows-client/networking/tcpip-addressing-and-subnetting (accessed on 5 April 2023).
  22. IANA. Service Name and Transport Protocol Port Number Registry. 2022. Available online: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml (accessed on 5 April 2023).
  23. Apache Spark Configuration—Spark 3.3.0 Documentation. 2022. Available online: https://spark.apache.org/docs/latest/configuration.html (accessed on 5 September 2023).
  24. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
  25. Piryonesi, S.; El-Diraby, T. Data Analytics in Asset Management: Cost-Effective Prediction of the Pavement Condition. J. Infrastruct. Systems. 2020, 26, 04019036. [Google Scholar] [CrossRef]
  26. Hastie, T.J.; Tibshirani, R.J.; Friedman, J.H. The Elements of Statistical Learning: Data Mining Inference and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
Figure 1. Overview of the experimentation on the UWF-ZeekDataFall22 dataset.
Figure 1. Overview of the experimentation on the UWF-ZeekDataFall22 dataset.
Electronics 12 05039 g001
Figure 2. UWF-ZeekDataFall22: Total executor memory vs. total processing time.
Figure 2. UWF-ZeekDataFall22: Total executor memory vs. total processing time.
Electronics 12 05039 g002
Figure 3. UWF-ZeekDataFall22: Total executor cores vs. total processing time.
Figure 3. UWF-ZeekDataFall22: Total executor cores vs. total processing time.
Electronics 12 05039 g003
Figure 4. UWF-ZeekDataFall22: Processing times varying the numbers of executors and executor cores.
Figure 4. UWF-ZeekDataFall22: Processing times varying the numbers of executors and executor cores.
Electronics 12 05039 g004
Figure 5. UWF-ZeekDataFall22: Processing time varying the number of executors and shuffle partitions.
Figure 5. UWF-ZeekDataFall22: Processing time varying the number of executors and shuffle partitions.
Electronics 12 05039 g005
Figure 6. UWF-ZeekDataFall22: Number of attributes vs. training time.
Figure 6. UWF-ZeekDataFall22: Number of attributes vs. training time.
Electronics 12 05039 g006
Figure 7. UWF-ZeekDataFall22: Reconnaissance: FPR of algorithms according to the number of features used.
Figure 7. UWF-ZeekDataFall22: Reconnaissance: FPR of algorithms according to the number of features used.
Electronics 12 05039 g007
Figure 8. UWF-ZeekDataFall22: Reconnaissance: Training time of algorithms according to the number of features used.
Figure 8. UWF-ZeekDataFall22: Reconnaissance: Training time of algorithms according to the number of features used.
Electronics 12 05039 g008
Figure 9. UWF-ZeekDataFall22: Reconnaissance: Averages for all algorithms according to the number of features.
Figure 9. UWF-ZeekDataFall22: Reconnaissance: Averages for all algorithms according to the number of features.
Electronics 12 05039 g009
Figure 10. UWF-ZeekDataFall22: Discovery: FPR of algorithms according to the number of features used.
Figure 10. UWF-ZeekDataFall22: Discovery: FPR of algorithms according to the number of features used.
Electronics 12 05039 g010
Figure 11. UWF-ZeekDataFall22: Discovery: Training time of algorithms according to the number of features used.
Figure 11. UWF-ZeekDataFall22: Discovery: Training time of algorithms according to the number of features used.
Electronics 12 05039 g011
Figure 12. UWF-ZeekDataFall22: Discovery: Averages for all algorithms according to the number of features.
Figure 12. UWF-ZeekDataFall22: Discovery: Averages for all algorithms according to the number of features.
Electronics 12 05039 g012
Figure 13. UWF-ZeekDataFall22: Resource development: FPRs of algorithms according to the number of features used.
Figure 13. UWF-ZeekDataFall22: Resource development: FPRs of algorithms according to the number of features used.
Electronics 12 05039 g013
Figure 14. UWF-ZeekDataFall22: Resource development: Training time of algorithms according to the number of features used.
Figure 14. UWF-ZeekDataFall22: Resource development: Training time of algorithms according to the number of features used.
Electronics 12 05039 g014
Figure 15. UWF-ZeekDataFall22: Resource development: Averages for all algorithms according to the number of features.
Figure 15. UWF-ZeekDataFall22: Resource development: Averages for all algorithms according to the number of features.
Electronics 12 05039 g015
Figure 16. Reconnaissance: Average accuracy of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Figure 16. Reconnaissance: Average accuracy of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Electronics 12 05039 g016
Figure 17. Reconnaissance: Average precision of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Figure 17. Reconnaissance: Average precision of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Electronics 12 05039 g017
Figure 18. Discovery: Average accuracy of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Figure 18. Discovery: Average accuracy of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Electronics 12 05039 g018
Figure 19. Discovery: Average precision of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Figure 19. Discovery: Average precision of binary classification: UWF-ZeekDataFall22 vs. UWF-ZeekData22.
Electronics 12 05039 g019
Table 1. UWF-ZeekDataFall22 MITRE ATT&CK tactic counts.
Table 1. UWF-ZeekDataFall22 MITRE ATT&CK tactic counts.
TacticCount
None350,339
Resource Development275,471
Reconnaissance51,492
Discovery16,819
Privilege Escalation3066
Defense Evasion3064
Execution30
Initial Access19
Command and Control17
Lateral Movement11
Persistence10
Collection1
Credential Access1
Table 2. Binning IP addresses in the UWF-ZeekDataFall22 dataset.
Table 2. Binning IP addresses in the UWF-ZeekDataFall22 dataset.
ClassificationOctet ValuesBin Numberdest_ip Countsrc_ip Count
Non-classNull or NA000
Class A0–12713,072,9666
Class B<=1912608,3483,719,960
Class C<=223333,4820
Class D<=239451680
Class E<=254500
OtherIf none of the above620
Table 3. Binning port numbers in the UWF-ZeekDataFall22 dataset.
Table 3. Binning port numbers in the UWF-ZeekDataFall22 dataset.
ClassificationPort NumbersBin Numberdest_port Countsrc_port Count
Well-known ports0–102313,505,38573,900
Registered ports<=49,1512205,6152,095,923
Dynamic/private ports<=65,535365,8121,606,989
OtherIf none of the above400
Table 4. Bins generated for Boolean and nominal attributes for the UWF-ZeekDataFall22 dataset.
Table 4. Bins generated for Boolean and nominal attributes for the UWF-ZeekDataFall22 dataset.
AttributesCount of Integer Bins
local_orig1
local_resp1
Protocol2
conn_state9
history32
service11
Table 5. Number of bins for continuous valued attributes in the UWF-ZeekDataFall22 dataset.
Table 5. Number of bins for continuous valued attributes in the UWF-ZeekDataFall22 dataset.
AttributesCount of Integer Bins
duration6
orig_bytes6
orig_pkts2
orig_ip_bytes6
resp_bytes2
resp_pkts2
resp_ip_bytes2
missed_bytes2
Table 6. Information gain values for the full dataset, UWF-ZeekDataFall22.
Table 6. Information gain values for the full dataset, UWF-ZeekDataFall22.
Full Dataset
AttributeInfo_Gain
history0.3769642
protocol0.36886144
dest_port0.34085092
service0.3125681
orig_bytes0.29707623
duration0.29682565
resp_bytes0.29655516
orig_ip_bytes0.29552263
orig_pkts0.29124352
dest_ip0.20953701
local_resp0.19356702
conn_state0.022014592
src_port0.003268247
src_ip0.001607323
local_orig5.02 × 10−4
missed_bytes0
resp_ip_bytes0
resp_pkts0
Table 7. UWF-ZeekDataFall22: Information gain values for resource development.
Table 7. UWF-ZeekDataFall22: Information gain values for resource development.
Resource Development
AttributeInfo_Gain
history0.8810788
protocol0.8810783
dest_port0.8239399
orig_ip_bytes0.823516
service0.81280994
orig_bytes0.7881991
duration0.7871984
resp_bytes0.7861375
orig_pkts0.778875
dest_ip0.6267121
local_resp0.59013766
conn_state0.032609735
src_port0.010610466
src_ip0.005892113
local_orig0.001804289
missed_bytes0
resp_ip_bytes0
resp_pkts0
Table 8. UWF-ZeekDataFall22: Information gain values for discovery.
Table 8. UWF-ZeekDataFall22: Information gain values for discovery.
Discovery
AttributeInfo_Gain
history0.88164663
protocol0.8805963
conn_state0.8794843
orig_ip_bytes0.8471305
service0.7876045
orig_bytes0.7738234
duration0.773805
resp_bytes0.7718542
orig_pkts0.76303893
dest_ip0.623519
dest_port0.60750306
local_resp0.57914233
src_port0.19649306
src_ip0.00906063
local_orig0.0017338
missed_bytes0
resp_ip_bytes0
resp_pkts0
Table 9. UWF-ZeekDataFall22: Information gain values for reconnaissance.
Table 9. UWF-ZeekDataFall22: Information gain values for reconnaissance.
Reconnaissance
AttributeInfo_Gain
protocol0.8821778
history0.8709951
conn_state0.8681412
orig_ip_bytes0.82774657
service0.8121441
orig_bytes0.7853897
duration0.78459907
resp_bytes0.78186846
orig_pkts0.7734535
dest_ip0.6330686
dest_port0.61232007
local_resp0.59225047
src_port0.011235193
src_ip0.006443019
local_orig0.001454109
resp_ip_bytes0
resp_pkts0
missed_bytes0
Table 10. UWF-ZeekDataFall22: Binning, training, and testing times for various Spark parameters.
Table 10. UWF-ZeekDataFall22: Binning, training, and testing times for various Spark parameters.
Test IDExecutor CountExecutor Core CountTotal Executor CoresExecutor MemoryTotal Executor Memory (GB)Binning Time (s)Training Time (s)Testing Time (s)
1521052569.516.40.091
25210105069.115.60.088
352102010071.716.20.107
4542052559.1115.130.151
55420105058.3315.160.113
654202010059.914.910.126
71022055063.9615.770.109
8102201010063.1815.940.105
91044055054.1513.150.121
10104401010052.5612.7880.125
1120240510057.313.840.134
1220480510051.914.70.123
13108801010049.6812.090.129
14128961012050.9612.540.122
15128961012049.4712.540.144
16128961012047.812.60.15
1724496512050.714.50.134
18616962012046.4514.870.133
19128961012028.527.940.083
2011616202044.7718.250.131
2121632204046.9816.570.172
2231648206038.4816.70.134
Table 11. UWF-ZeekDataFall22: Total time for additional Spark configuration parameters.
Table 11. UWF-ZeekDataFall22: Total time for additional Spark configuration parameters.
Test IDExecutor CountCores Per ExecutorMemory per ExecutorTotal Exec. CoresTotal Executor Memory (GB)Total Time (s)Shuffle PartitionsDriver CoresDriver Memory (GB)
1525102585.93200210
25210105084.8200210
352201010088.09200210
4545202574.41200210
55410205073.61200210
654202010074.98200210
71025205079.84200210
8102102010079.24200210
91045405067.43200210
10104104010065.44200210
1120254010071.29200210
1220458010066.7200210
13108108010061.91200210
14128109612063.62200210
15128109612062.1672210
16128109612060.612210
1724459612065.4724210
18616209612061.466210
19128109612039.3324210
2011620162053.11210
2121620324065.172210
2231620486054.793210
23108108010040.37200210
Table 12. UWF-ZeekDataFall22: Binary classification results for the reconnaissance tactic.
Table 12. UWF-ZeekDataFall22: Binary classification results for the reconnaissance tactic.
ML Algo# of Attrs.Accuracy
(%)
Precision
(%)
Recall
(%)
F-Measure (%)FPR
(%)
Training Time (s)Testing Time(s)
LR699.8999.7499.9199.820.113.710.1
LR999.8999.7499.9199.820.113.710.1
LR1299.8999.7499.9199.820.113.240.055
LR1899.7799.7399.4799.600.1140.098
NB699.3396.9399.6598.270.741.420.044
NB999.3396.9399.6598.270.741.450.09
NB1299.3396.9399.6598.270.741.340.086
NB1899.3396.9399.6598.270.741.60.089
RF6100.00100.00100.00100.000.009.8240.103
RF9100.00100.00100.00100.000.009.770.087
RF12100.00100.00100.00100.000.009.7590.099
RF18100.00100.00100.00100.000.002.0390.08651
GBT6100.00100.00100.00100.000.0017.380.1101
GBT9100.00100.00100.00100.000.0017.590.113
GBT12100.00100.00100.00100.000.0017.390.108
GBT18100.00100.00100.00100.000.0011.820.112
DT6100.00100.00100.00100.000.005.180.13
DT9100.00100.00100.00100.000.005.50.09
DT12100.00100.00100.00100.000.005.310.14
DT18100.00100.00100.00100.000.005.750.14
SVM672.9682.9834.9649.209.8713.370.059
SVM980.880.000.000.000.0033.440.059
SVM129.970.000.000.0087.6614.640.069
SVM1899.99100.0099.9199.950.0041.870.111
Table 13. UWF-ZeekDataFall22: Binary classification results for the discovery tactic.
Table 13. UWF-ZeekDataFall22: Binary classification results for the discovery tactic.
ML Algo.# of Attrs.Accuracy
(%)
Precision
(%)
Recall
(%)
F-Measure
(%)
FPR
(%)
Training Time (s)Testing Time (s)
LR699.9099.64100.0099.820.143.350.093
LR999.9099.64100.0099.820.143.540.092
LR1299.9099.64100.0099.820.143.230.048
LR1899.9099.64100.0099.820.143.420.093
NB699.6696100.0099.120.411.380.061
NB999.6698.26100.0099.120.411.380.089
NB1298.9094.99100.0097.421.241.410.093
NB1898.9994.96100.0097.421.241.420.0875
RF699.9099.80100.0099.900.319.3210.099
RF999.9099.80100.0099.900.319.4910.1
RF1299.9099.80100.0099.900.319.45520.102
RF1899.9099.80100.0099.900.311.8690.0985
GBT6100.00100.00100.00100.000.0014.030.1066
GBT9100.00100.00100.00100.000.0014.4590.1016
GBT12100.00100.00100.00100.000.0014.4070.103
GBT18100.00100.00100.00100.000.008.5990.107
DT6100.00100.00100.00100.000.005.140.12
DT9100.00100.00100.00100.000.005.460.13
DT12100.00100.00100.00100.000.005.250.09
DT18100.00100.00100.00100.000.005.120.07
SVM690.8767.54100.0080.6211.0027.890.06
SVM996.6499.1583.0390.380.1614.20.06
SVM1282.09100.005.6510.700.0015.490.067
SVM18100.00100.00100.00100.000.0039.420.08
Table 14. UWF-ZeekDataFall22: Binary classification results for the resource development tactic.
Table 14. UWF-ZeekDataFall22: Binary classification results for the resource development tactic.
ML Algo.# of Attrs.Accuracy
(%)
Precision
(%)
Recall
(%)
F-Measure
(%)
FPR
(%)
Training TimeTesting Time
LR699.8799.6199.9699.780.178.410.118
LR999.8799.6199.9699.780.178.90.118
LR1299.8799.6199.9699.780.178.770.068
LR1899.8599.6199.9099.750.178.930.136
NB699.3396.7899.9798.350.835.170.053
NB999.3396.7899.9698.350.835.240.096
NB1299.3499.9199.9099.380.802.350.089
NB1898.5593.3699.9096.521.785.360.104
RF699.9999.99100.0099.990.0012.690.17
RF999.9999.99100.0099.990.0013.160.12
RF1299.9999.99100.0099.990.0016.10.0921
RF1899.9999.99100.0099.990.009.950.108
GBT6100.00100.00100.00100.000.0024.340.114
GBT9100.00100.00100.00100.000.0023.680.113
GBT12100.00100.00100.00100.000.0027.040.125
GBT18100.00100.00100.00100.000.0023.330.13
DT6100.00100.00100.00100.000.008.390.12
DT9100.00100.00100.00100.000.008.660.12
DT12100.00100.00100.00100.000.008.730.078
DT18100.00100.00100.00100.000.008.810.08
SVM699.8799.3699.9999.6816.0029.230.06
SVM936.363.407.004.0054.5223.340.035
SVM1236.161.703.792.3054.7827.20.035
SVM1892.1271.7999.9983.579.8627.870.034
Table 15. UWF-ZeekDataFall22: Multinomial classification results.
Table 15. UWF-ZeekDataFall22: Multinomial classification results.
ML Algo.No. of Attrs.Weighted Precision
(%)
Weighted Recall
(%)
Weighted F-Measure
(%)
Weighted Accuracy
(%)
FPR
(%)
Training
Time
Testing
Time
LR686.3692.9389.5292.9592.9315.80.11
LR986.3692.9389.5292.9592.9315.950.1
LR1286.3692.9389.5292.9592.9316.30.11
LR1886.3692.9389.5292.9592.9316.550.1
NB67.620.62351.1528.8292.189.020.12
NB97.580.6221.158.8292.1811.110.144
NB127.720.6721.29.891.111.270.13
NB187.720.6721.29.70691.111.160.136
RF699.9499.9799.9599.990.001235.90.099
RF999.9499.9799.9599.990.001236.60.101
RF1299.9499.9799.9599.990.0012370.1012
RF1899.9499.9799.9599.990.001222.9370.1
DT699.9599.9799.9699.990.080413.450.104
DT999.9599.9799.9699.990.067213.910.12
DT1299.9599.9799.9699.990.073712.290.143
DT1899.9599.9799.9699.990.073813.610.128
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bagui, S.S.; Mink, D.; Bagui, S.C.; Madhyala, P.; Uppal, N.; McElroy, T.; Plenkers, R.; Elam, M.; Prayaga, S. Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework. Electronics 2023, 12, 5039. https://doi.org/10.3390/electronics12245039

AMA Style

Bagui SS, Mink D, Bagui SC, Madhyala P, Uppal N, McElroy T, Plenkers R, Elam M, Prayaga S. Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework. Electronics. 2023; 12(24):5039. https://doi.org/10.3390/electronics12245039

Chicago/Turabian Style

Bagui, Sikha S., Dustin Mink, Subhash C. Bagui, Pooja Madhyala, Neha Uppal, Tom McElroy, Russell Plenkers, Marshall Elam, and Swathi Prayaga. 2023. "Introducing the UWF-ZeekDataFall22 Dataset to Classify Attack Tactics from Zeek Conn Logs Using Spark’s Machine Learning in a Big Data Framework" Electronics 12, no. 24: 5039. https://doi.org/10.3390/electronics12245039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop