Next Article in Journal
Foreign Object Detection for Electric Vehicle Wireless Charging
Next Article in Special Issue
Measuring the Impact of Accurate Feature Selection on the Performance of RBM in Comparison to State of the Art Machine Learning Algorithms
Previous Article in Journal
Flow-Shop Predictive Modeling for Multi-Automated Guided Vehicles Scheduling in Smart Spinning Cyber–Physical Production Systems
Previous Article in Special Issue
A Review of Performance, Energy and Privacy of Intrusion Detection Systems for IoT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection

by
Robertas Damasevicius
*,
Algimantas Venckauskas
,
Sarunas Grigaliunas
,
Jevgenijus Toldinas
,
Nerijus Morkevicius
,
Tautvydas Aleliunas
and
Paulius Smuikys
Faculty of Informatics, Kaunas University of Technology, 51386 Kaunas, Lithuania
*
Author to whom correspondence should be addressed.
Electronics 2020, 9(5), 800; https://doi.org/10.3390/electronics9050800
Submission received: 14 April 2020 / Revised: 9 May 2020 / Accepted: 11 May 2020 / Published: 13 May 2020

Abstract

:
Network intrusion detection is one of the main problems in ensuring the security of modern computer networks, Wireless Sensor Networks (WSN), and the Internet-of-Things (IoT). In order to develop efficient network-intrusion-detection methods, realistic and up-to-date network flow datasets are required. Despite several recent efforts, there is still a lack of real-world network-based datasets which can capture modern network traffic cases and provide examples of many different types of network attacks and intrusions. To alleviate this need, we present LITNET-2020, a new annotated network benchmark dataset obtained from the real-world academic network. The dataset presents real-world examples of normal and under-attack network traffic. We describe and analyze 85 network flow features of the dataset and 12 attack types. We present the analysis of the dataset features by using statistical analysis and clustering methods. Our results show that the proposed feature set can be effectively used to identify different attack classes in the dataset. The presented network dataset is made freely available for research purposes.

1. Introduction

Network attacks are a set of network traffic events which are aimed at undermining the availability, authority, confidentiality, integrity, and other critical properties of networked computer systems [1]. Various types of cyber-attacks, such as IP spoofing [2,3] and Distributed Denial-of-Service (DDoS) flooding attacks [4], have been recognized as a serious security problem. With the increased scope, type and complexity of computer systems and communication networks, as well as with the emergence of new types of distributed computing technologies (such as the Internet-of-Things (IoT), Edge Computing, Fog Computing, etc.), new types of threats continue to arise against usual user requirements for privacy, security, and trust [5,6,7,8]. Despite numerous research studies in intrusion detection [9,10], there is still a large number of successful cyber-attacks registered each year, which affects the daily operation of businesses and governments, but also can cripple critical infrastructures [11], cloud-based IoT environments [12], cloud storage services [13], wireless sensor networks (WSN) [14], wireless body are networks (WBAN) in telemedicine systems [15], wireless ad-hoc networks [16], in-vehicle networks [17], software defined networks (SDN) [18,19], industrial IoT networks [20], cyber-physical systems [21], Internet of Drones (IoD) [22], Industry 4.0 smart factories [23], Internet of Medical Things (IoMT) such as implantable medical devices [24], IoT edge devices [25], vehicular ad hoc networks (VANET) [26], fifth-generation (5G) mobile networks [27], fog computing services [28], and user smartphones [29].
The increasing trend of cyber-attacks demonstrates that current intrusion detection methods still require improvement, and the development of new techniques is necessary for maintaining defense against cyber-attacks. Traditional security technologies and techniques often do note cope with emerging cybersecurity challenges in distributed, dynamic, heterogeneous, and wireless computing environments. As a result, there is a need to create new network security methods and systems to address the cybersecurity threats, as well as to collect, organize, and make openly available real-world heterogenous benchmark datasets [30] which would allow the comparison and validations of the capabilities of such systems in addressing the cybersecurity threats [31].
Network Intrusion Detection Systems (NIDS) are specifically designed to observe network traffic flows in order to identify any attacks and provide protection for the sensitive network infrastructure [32,33]. Each NIDS analyses network flows data, to check if there are any attacks. If there are any suspicious activities, the NIDS trigger an alert. However, usually a huge number of alerts is generated and deluges network security personnel [34]. The produced alerts often include irrelevant and redundant features, which result in higher resource consumption and lower attack-recognition accuracy. Moreover, uninformative features result in the reduction of attack-identification accuracy. Therefore, the dataset should contain only relevant features, allowing for efficient network-attack detection. The example of NIDS is a system for alarm correlation in Early Warning Systems (EWSs) [35] that uses port numbers, source and destination IP addresses, intrusion type, attack severity, and time stamp for alert generation.
The effectiveness of NIDS is evaluated based on their performance to recognize attacks, which requires a network dataset that provides examples of both normal and abnormal network traffic [36]. Old benchmark datasets such as KDDCup’99 [37] and NSL-KDD [38] have been widely used for evaluating the accuracy of network-attack recognition [39,40,41,42]. However, these datasets have become obsolete due to the rapid development of network technologies and the emergence of new cybersecurity threats and network attack types [33,34,35,36,37,38,39,40,41,42,43,44,45]. Many of the network flow benchmark datasets were created artificially, or the data were collected from a highly controlled environment, making them not representative of real-life network flows, while the methods trained on these datasets cannot cope with real-world network attacks [36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]. Moreover, as the real-life network traffic grows enormously, manual data labeling of ever-increasing datasets becomes an infeasible task.
In supervised datasets, each observation in the dataset must have a label assigned to classify known attacks [51]. Unsupervised network-intrusion detection aims at detecting abnormal network node behavior, which can be attributed to a network intrusion or cyber-attack. For example, Casas et al. [52] employed a change detection method to identify the malicious network behavior. The flows are clustered, and the clusters are ranked by their abnormality; and the clusters exceeding the detection threshold are labeled as malicious. Another approach [53] uses entropy and Principal Component Analysis (PCA) to detect anomalies. Despite its successes, the unsupervised approach still requires large and up-to-date network traffic datasets reflecting real-life behavior of network nodes [54]. Umer et al. [55] employed a two-stage model for intrusion detection, in which a one-class Support Vector Machine (SVM) detects malicious flows, while a Self-Organizing Map (SOM) clusters of malicious flows into specific attack clusters. Unsupervised methods can potentially recognize unknown attacks with no prior knowledge, on which the supervised methods (which require datasets with data labeled by attack type) fail miserably [11]. To tackle these problems, Fadllulah et al. [56] extracted features for detecting attacks against encrypted protocols and generated normal-usage behavior profiles. The deviations from these normal profiles, which were identified by using the nonparametric cusum algorithm, were considered to be attacks. Zhang et al. [57] suggested using flow label propagation based on a Nearest Cluster based Classifier (NCC) to label network flows from an unlabeled dataset to convert the problem into supervised learning.
In order to develop new efficient network-intrusion-detection methods, the realistic and up-to-date network flow benchmark datasets are required [58]. Here, we present a new benchmark dataset collected from a real-world network and aimed at assessing the performance of NIDS in network-attack detection. Following the observation of Ring et al. [58], we aim to propose a dataset that is up-to-date, correctly annotated, publicly available, has real-world network traffic with multiple types of network attacks and examples of typical network behavior, and covers a considerable period of time. The dataset was created at KTU LITNET network from 06/03/2019 to 31/01/2020. The nfcap tool was used to capture network traffic. In order to create the traffic features, Nfsen, MeSequel, and Python script tools were used for feature generation. The dataset was annotated, to provide the examples of network-attack types. The key characteristics of the proposed dataset are a representation of real-world network traffic, which contains both normal network traffic and many examples of network attacks on real-world network infrastructure.
The remaining parts of the paper are structured as follows. Section 2 analyses the characteristics of known benchmark datasets and discusses their advantages and disadvantages. The environment of network data collection is described in Section 3. Section 4 presents a comprehensive comparison of the proposed dataset with other recently published network-intrusion datasets. Section 5 provides a detailed description of the files of the LITNET-2020 dataset. Finally, Section 6 presents conclusions and discusses further research plans.

2. Overview of Similar Datasets

The usability of any NIDS dataset reflects its power to provide information necessary to training the NIDS efficiently so that a high level of accuracy and reliability is achieved in detecting a diverse (as much as possible) set of network attacks. In this section, we discuss the main features of known network intrusion datasets (DDoS 2016 [43], UNSW-NB15 [59], CICIDS 2017 [60], UGR’16 [61], NSL-KDD [38], and CSE-CIC-IDS2018 [62]). Recently, several new network datasets have been proposed [63,64,65,66]. However, these have not yet been adopted by the research community as benchmark datasets. Therefore, hereinafter, for analysis, we use only datasets that have been widely adopted and used by the research community [58,67,68,69].

2.1. DDoS 2016

The dataset presents data collected in a controlled environment (using Network Simulator NS2), which has four malicious kinds of network attack: HTTP Flood, UDP flood, DDOS Using SQL injection (SIDDOS), and Smurf. The dataset has 27 features, 5 classes (4 attack classes and one normal traffic class) and 734,627 records.

2.2. UNSW-NB15

The UNSW-NB 15 dataset was generated by the IXIA PerfectStorm tool in a small network environment (only 45 unique IP addresses) over a short (31 h) period of time, and it includes a mix of real typical activities and artificial attack behaviors of the network traffic, resulting in 175,341 records for training and 82,332 records for testing. The IXIA tool simulated nine types of attacks. The dataset provides 49 features for analysis, which include basic features, content features (based on the content of packets), time features (based on time characteristics of packet flow), and additional generated features based on the statistical characteristics of connections.

2.3. CICIDS 2017

The dataset was made public by the Canadian Institute for Cybersecurity. The creation methodology used two types of usage profiles and multistage attacks, such as Heartbleed, and a variety of DoS and DDoS attacks. It has 80 network traffic features that are extracted by using the CICFlowMeter tool. User profiles were based on the abstract human behavior of 25 users working with the HTTP, HTTPS, FTP, SSH, and email protocols, aiming to generate the background traffic. The traffic was generated for a short (5 days) span of time.

2.4. UGR’16

This dataset was originated by the University of Granada (Spain) and is aimed for the assessment of cyclostationary NIDSs. The dataset was acquired from a tier-3 Internet Service Provider (ISP) over four months and has 16,900 million single directional flows. The real network traffic was mixed with synthetically generated malicious attack flows captured in a controlled network environment that somewhat decreases the quality of the dataset. It has 13 types of malware, including annotated botnet, SSH scan, and SPAM attacks, as well as background and normal network traffic, where background assumes that it is not known whether it contains a malicious traffic. The dataset was labeled by using the logs from the honeypot system.

2.5. NSL-KDD

NSL-KDD is an enhancement of the KDD dataset. In the KDD dataset, classification was biased toward more recurring records. However, in the NSL-KDD dataset, redundant items were removed, preventing the classifiers from achieving unreasonably high detection rates due to reoccurring records. The dataset includes 4 classes of attack: Denial of Service (DoS), Probe, User to Root (U2R), and Remote to Local (R2L). The training set has 4,898,431 records, and the testing set has 311,027 records.

2.6. CSE-CIC-IDS2018

The dataset covers six types of network attacks: Botnet, brute-force, Denial of Service (DoS), Distributed DoS (DDoS), infiltration, and web attacks. The dataset was generated based on the synthetic user profiles, which capture abstract representations of network events and behaviors. Fifty network nodes were used to organize an attack on the victim infrastructure with 420 computers and 30 servers. The dataset includes 84 network traffic features extracted from the network traffic, using the CICFlowMeter-V3 tool.

2.7. Summary

The comparison of the analyzed datasets by examples of attack types represented are summarized in Table 1. Here, Fuzzer aims to cause a network node suspended by transmitting to it the random data. Virus is a self-replicating malicious program that intrudes on the computer system without the knowledge of the user. Worm spreads through the network without the user’s permission, while consuming network bandwidth resources. Trojan is a malicious program that causes the security problems in the network, while masquerading as a useful program. DoS aims to reject access to network nodes or resources for other users. Network Attack is an attempt to endanger network security from the data link layer to the application layer. Physical Attack attempts to cripple the physical units of computers or networks. Password Attack aims to obtain a password by login, and can be discovered by several login failures. Information Gathering Attack searches for known security holes by scanning or probing network nodes. User to Root (U2R) attack aims to take advantage of vulnerabilities of a network system in order to gain privileges as the super-user of the system. Remote to Local (R2L) attack dispatches packets to a remote computer system, without having a valid account on that system, aiming to obtain access either as a user or as a root. Probe attack scans the networks aiming to find valid IP addresses and to collect private data about the host, in order to start an attack on a selected set of systems and services.
The comparison of old reference (DARPA’98 [70] and KDDCup’99 [71]) datasets and more recent network intrusion datasets is presented in Table 2.

2.8. Conclusion of Dataset Analysis

New cyberthreats and types of attack continue to emerge. As a result, new realistic network datasets are needed to keep the development and benchmarking of network intrusion methods up-to-date. This has motivated us to collect network flow data from real-world network and to present it as an open benchmark dataset to be used freely for the research community in the cyber security domain.

3. Proposed Dataset

In this section, we describe the network environment used for the collection of network traffic data, provide the description of network attacks that are present in the dataset, provide the descriptive characteristics of the dataset, and discuss dataset preprocessing and availability.

3.1. Network Environment

Infrastructure consists of connecting nodes with operating equipment and communication lines connecting those exporters (CAPACITY, CYTI1, KTU UNIVERSITY 1, KTU UNIVERSITY 2, and FIREWALL). The LITNET NetFlow topology consists of two main parts (senders and collector). The NetFlow sender are Cisco (Cisco Systems Inc., San Jose, CA, USA) routers. Fortige (FG-1500D) high-performance-next generation firewalls analyze the data that has passed through it, process the data, and send it to one or more NetFlow server collectors. The NetFlow server (collector) is a server with appropriate software (nfcapd, nfdump, nfexpire, nfprofile, nfreplay, and nftrack. Version: 1.6.15), which is responsible for receiving, storing, and filtering data. We are used 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 x86_64 GNU/Linux operating system; 4 core Intel Xeon Processor (Skylake, IBRS) CPU processor; 10GB for system and 30T hard disks for collecting data; and 8 GB of RAM.
Each of the NetFlow Exporters (CAPACITY, CYTI1, KTU UNIVERSITY 1, KTU UNIVERSITY 2, and FIREWALL) continuously monitors the flows passing through it (this is a sequence of previous data packets in one direction from a specific sender to a specific recipient) and caches them when it receives new traffic. The main ring connects the five largest Lithuanian cities (see Figure 1), which are as follows: (CITY1) Kaunas–Vytautas Magnus University and Kaunas Technological University, which is administrator of Lithuanian Research and Education Network (LITNET) and maintenance and development connecting nodes and Vilnius University (CAPACITY); Vilnius Gediminas Technical University (CAPACITY); Klaipeda University (CITY2); Siauliai University (CITY3); and KTU Panevezys Faculty of Technologies and Business (CITY4). For efficiency reasons, the NetFlow sender only scans the first packet of the new stream, which saves the corresponding values, and then subsequent packets of the same stream are processed according to the same policy, thus reducing the load on the network device. Kaunas University of Technology (FIREWALL) has a high availability infrastructure and a perimeter Fortigate 1500D with FortiOS operating system, 80 Gbps network, IPS (intrusion prevention system) 13 Gbps, NGFW (next-generation firewall) 7 Gbps, and Threat Protection 5 Gbps bandwidth, firewall. CITY1 has an exit to broadband networks NORDUNET, and GEANT. CITY1 and CITY2 have a peering connection, and this is a process by which two Internet networks connect and exchange traffic. CITY1 and CITY2 can transfer data traffic directly between each other’s LITNET users. Every other city (CITY2, CITY3, and CITY4) has end users. These are schools, municipalities, other organizations.
We used a 5 min time interval to make a nfcap (KTU UNIVERSITY1, KTU UNIVERSITY2, CITY1, and CAPACITY) file (nfcap is an application programming interface (API) for capturing network traffic with the format nfcapd.YYYYMMDDHHMM) and sent it to the NetFlow server to process information. In this time interval, we counted the number of packets, which satisfied the attack rules to distinguish the type of attack.

3.2. Description of Network Attacks

We describe the attack types in the proposed dataset as follows.
Smurf attack keeps sending the Internet Control Message Protocol (ICMP) broadcast requests to the network on behalf of the target node, aiming to flood the node with network traffic in order to slow down the targeted node.
An Internet Control Message Protocol (ICMP) flood attack is a DoS attack which aims to overwhelm a targeted network node with ICMP echo-requests (pings).
UDP-flood attack is a DoS attack using the User Datagram Protocol (UDP). A DNS Flood Attack (DNS Flooding) is an application-specific variant of a UDP flood, which is characterized by network packets sent to any IP address, using UDP protocol and port 53 as the target.
TCP SYN-flood attack is a Distributed DoS (DDoS) attack that misuses a part of the ordinary Transmission Control Protocol (TCP) three-way handshake to drain resources on the victim node and make it unresponsive. The attack packets packages have S flags but do not have the AFRPU flags.
HTTP-flood attack is a DDoS attack, which exploits seemingly legitimate Hyper Text Transfer Protocol (HTTP) GET or POST requests to assail a web server or application. In a complex Layer 7 attack, HTTP floods do not use ill-formed packets, spoofing, or reflection and need less bandwidth than other types of attacks, in order to disable the victim server or site. The attack packets are directed only to 80 port.
LAND attack is a Layer 4 DoS attack in which the malicious node sets the same source and destination data of a TCP segment. The attack packets have S flags and use the TCP protocol. An attacked node will hang due to the same packet being repeatedly processed by the TCP stack.
W32.Blaster Worm attack spreads by utilizing the Buffer Overrun Vulnerability of Microsoft Windows DCOM RPC Interface. The attacks are directed only to 135, 69 (TFTP), and 4444 (Kerberos) ports.
Code Red Worm attack aims to cause a buffer overflow problem on a target node, so that it begins to overwrite the adjacent memory. The packets are directed to source IP and only to 80 (no Secure Sockets Layer (SSL)) ports; this is how the HTTP GET method is applied.
Spam bot’s attack dispatches spam messages or posts spam in social media platforms or forums. The packets are directed only to 25 (no SSL) port. The attack is characterized by the presence of an excessively large number of SMTP connections from one address.
Reaper Worm attack begins its last phase of scanning once the IP is passed to the exploit process. Reaper attack is directed at TCP ports 81, 82, 83, 84, 88, 1080, 3000, 3749, 8001, 8060, 8080, 8081, 8090, 8443, 8880, and 10,000. An attack is only recorded when the package contains the TCP stream and have not UDP or ICMP or ICMP6 protocols.
Port Scanning/Spread attack dispatches client requests to some server port addresses, aiming to discover an active port and taking of advantage of a known security hole. An abnormal number of connections from one host to one or more other hosts is as follows: several ports, one address; single port, multiple addresses.
Packet fragmentation attack is a kind of DoS attack, in which the attacker overloads a network by taking advantage of the datagram fragmentation.

3.3. Descriptive Characteristics

The traffic analysis is described for the cumulative flows while generating the dataset. The descriptive characteristics of the dataset are given in Table 3.
Table 4 represents the distribution of all data instances of the proposed dataset. All instances are categorized into ordinary data and attack data. The attack instances are further categorized into nine classes, according to the type of the network attack.
A DoS attack with SYN packet can be explained in a simple way as the flow of illegal traffic to network resources from an IP address or the flow of IP addresses that results in a lack of network resources. The attackers disrupt the three-way click sequence by not responding to the SYN-ACK from the server, or they will constantly send a SYN packet from a non-existent IP, the server actually supports the queue set to which the SYN-ACK is sent because there will be no response from the clients, the queue will overflow, and the server will no longer be available. This is called a SYN Attack or Flood. The example of network traffic flows is shown in Figure 2.
Here, we can see that there is an obvious anomaly in sending SYN packets on the Kaunas (CITY1) channel. As you can see in the graph, this attack lasts two days (started from 2019-10-05 and ended on 2019-10-07). We also see that this attack occupies 14 Mb/s data traffic on the Kaunas (CITY1) channel. According to the presented real case, we can see that, on 2019-10-07 at 09:30, there was an attack peak of TCP SYN packets. This is evidenced by the huge number of packets in the data stream layout and the exceptional increase in data traffic on the graph.
For visualization of NetFlow, based on our proposed attack detection, the rules (see Figure 2; Profile: SYN) were developed for automatic notification of a possible cyber incident. For analysis of data stored in the collector (NetFlow server), we used flow-tools, nfstat, flowd, nfsen 1.3.6p1, php version-7.4.1, and Apache-2.4.25. MySQL database: version-10.1.41 (with 45 492 310 table records).

3.4. Dataset Availability and Preprocessing

For the collection of data, we use a methodology suggested in [61]. The network traffic data are captured in the nfcapd binary format files. The nfcapd files are collected in a single file per week for two capture periods. The mean size of files is about 1.35 GB (compressed). The nfcapd files have all NetFlow features, extended with 19 custom attack detection features which starts from 2019-03-06 first flow and 2020-01-31 last flow. The IP addresses of network nodes have been anonymized. Information from senders (NetFlow raw data files) to the collector is received in NetFlow v9 format (rfc3954). All values are transferred to the MySQL database (NetFlow database SQL server).
All dataset files can be freely downloaded from our website: https://dataset.litnet.lt.
The dataset formation is summarized in Figure 3. The data preprocessor selects 49 attributes that are specific to the NetFlow v9 (RFC 3954) protocol to form a dataset. The Data extender expands the generated dataset with additional fields of time, tcp flags, which are later used to identify attacks. The Extended dataset is supplemented by a set of 15 attributes. The generator creates additional 19 attributes for attack type recognition (see Table 5). The combinations of these and NetFlow attributes are used to detect attacks. We also added two additional fields to separate in the dataset, where the record is assigned to the attack and what specific type of attack, and where the normal network traffic is. Therefore, we have a total of 85 attributes.

3.5. Summary

The proposed dataset was collected in the real-world network, over an extended period of time (10 months), and contains real network attacks over the country-wide network infrastructure with servers in four geographically distributed locations (cities). As such, the proposed network flow dataset is more advantageous than some of its counterparts (such as UNSW-NB 15 dataset [59]), which generated the attacks artificially and thus do not contain realistic data.

4. Description and Statistical Analysis of Dataset Features

This section presents the description and analysis of dataset features. First, we formulate the requirements for dataset features. Then, we present the description of different classes of features. Next, we analyze the statistical distribution of the feature values and illustrate the results by figures. Finally, we summarize the results.

4.1. Requirements for A Dataset and Its Features

We formed the dataset by using the requirements for NIDS evaluation datasets outlined in [61] as follows. The dataset features should include network flow characteristics, such as IP addresses and port numbers, number of packets and bytes, flow duration, and flags. The dataset records should be correctly labeled as malicious or not, and in case of attack records, they should also include the type of attack. The dataset should cover several different periods of network activity, such as daytime/nighttime and weekdays/weekends.

4.2. Description of Features

In describing the features, we follow the description scheme suggested in [72] that considers the flow, basic, content, general purpose time slice, and connection features. These are summarized in Table 5, Table 6 and Table 7. The source_IP, target_IP, and time are noted as key for intrusion detection [73]. Source and destination ports, source and destination IP addresses, and time were mentioned as the most informative features for intrusion alerting [19]. Similar time-slice features based on the calculation of unique IP addresses within a time window were successfully used for network-attack detection before [74]. Additionally, we provide two network-attack attributes labeled by the network security experts, in Table 8.

4.3. Analysis of Features

Following [75], we calculated the mean and standard deviation of all features. We studied the feature variance, using the cumulative distribution function (CDF), as was suggested in [76]. The results are shown in Figure 4.
These feature distributions are heavy tailed, and the smaller values make most cases. Moreover, 95% of source network nodes connect to fewer than 29 unique destination IPs, only 1% of source nodes may connect to more than 185 unique IPs, and only 0.1% connect to more than 2900 unique IPs (SYN attack subset of the dataset). Such outliers help in identifying the malicious behavior in the network. Feature value distributions also reveal possible correlations between features. For example, the Pearson correlation between input packets and input bytes is 0.92, when an attacker performs the SYN attack. Such high levels of correlation allow us to identify features containing duplicate information.
To analyze the dynamical changes in the network flows, we applied window slices and observed the change of unique source and destination IPs over time. Here, we used a window of 10,000 NetFlows moved with a step of 5000 NetFlows. The results are presented in Figure 5 for the IP addresses and in Figure 6 for the port connections. One can see sharp changes in the behavior of network nodes, which may be indicative of the network attack.
The distribution of data according to the protocol types is presented in Figure 7. The most common protocols in the dataset are TCP and UDP.
The temporal frequency of the source and destination ports in the NetFlow connections are presented in Figure 8. As a baseline, a reference frequency is given, if the distribution of connections to ports would be uniform. Note that connections from/to some ports are much more frequent, e.g., the most frequent source ports are 54,438 and 444, while the most frequent destination ports are 444 and 54.
The statistical distribution of the connection flags in the TCP SYN-flood subset of the dataset is presented in Figure 9. It shows that most of the network connections had the S (TCP SYN) flag.
In case of the scan spread attack, the attacker scans for new IPs in the network. As a result, the number of unique destination IPs in a NetFlow window slice grows steadily during the attack (see Figure 10, left). Moreover, the attacker performs the port scan on the attack nodes, which can be seen from the distribution of port numbers in time (see Figure 10, right).
The values of the time-slice-based connection features are presented in Figure 11. Note the sharp changes and peaks in the values, which may be indicative of network attacks. The statistical distribution of feature values is highly skewed and shows a considerable difference between the values. We used the violin plot, which was already used to visualize the distribution of network-traffic features before, in [58]. See the violin plot of feature-value distribution presented in Figure 12.
For the analysis and unsupervised clustering, we first performed the normalization of the dataset features. For the analysis of unsupervised datasets, the dimensionality reduction methods, such as Principal Component Analysis (PCA), are often used [77]. Here, we applied the t-stochastic Neighbor Embedding (t-SNE) method [78] to reduce the dimensionality to two dimensions. The resulting low-dimensional embedding has clusters assigned by the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [79]. As a result, we obtained 12 clusters corresponding to different network-attack types, which are shown in Figure 13.
To analyze the differences between clusters, we adopted a pairwise two-sample multivariate Kolmogorov–Smirnov (KS) test, which is used to determine whether two sets of data arise from the same or different distributions. The null hypothesis was that the data in both pairs of compared clusters are drawn from the same continuous distribution. We used a two-dimensional version of the KS test because the low-dimensional embedding was two-dimensional. The hypothesis was rejected (p < 0.001) for all pairs of clusters.
The distribution of feature values according to the attack-behavior clusters in time can be seen in Figure 14. To allow for better comparison, all feature values were normalized to (0,1).
To evaluate the significance of each feature, we used two tests. First, we applied the t-test-based feature, ranking using each cluster vs. all other clusters as a dependent variable. In another test, we used the split-half approach by splitting the set of clusters in half randomly and performing feature ranking, while the procedure was repeated N (N = 1000) times. The results of feature ranking were analyzed by using the non-parametric ranking-based Friedman test, and the result was statistically significant (p < 0.05). Finally, the post hoc Nemenyi test was applied, and its results were presented by using the significance diagram [80] (see Figure 15). Note that, according to the one-versus-all splitting, there is no statistically significant difference between the feature ranks (Figure 15a); therefore, all features are significant and contribute to the constructions of clusters. We also performed the split-half testing. The feature-ranking results (Figure 15b) show that features F6, F9, F1, and F10 have the highest rank among all features, which is statistically significant (p < 0.001).

4.4. Conclusions of Dataset Analysis

The statistical analysis of the features included in the proposed dataset shows that the features can be to for detecting various types of network attacks. In particular, the time slice connection features can be embedded to lower dimensional space, where clustering methods can be applied to map the low-dimensional representation of features to network attack classes.

5. Comparison with Other Datasets

Table 9 shows a comparative analysis with the analyzed datasets, according to the number of networks, number of unique IP address, period of the data collection, attack vectors, and the number of features for each dataset.
The proposed network dataset was collected for a longer period of time (10 months) than other analyzed datasets; it covers more network-attack classes (12) and contains more features (85). Therefore, the proposed dataset could present a valuable contribution to the research community and enrich the available set of datasets for the development and improvement of new network-attack recognition methods.

6. Conclusions

Known network-intrusion benchmark datasets usually do not provide the realistic case of the modern network-traffic and network-attack scenarios. In contrast, the proposed dataset contains real-world network traffic data and annotated attack examples, rather than artificially simulated attacks executed in the sandbox network environment.
To facilitate the improvement of existing network-intrusion-detection methods, and the development of new, we have suggested a new network flow dataset. The dataset has 85 features that can be used to recognize 12 different types of network attacks. We provided an analysis and comparison of the proposed dataset with two classical and four other modern datasets by key features and described its advantages and limitations. Our dataset contains real network traffic captured over 10 months. This provides an advantage over synthetically generated datasets, because an artificial synthesis of network traffic might lead to incorrect network-attack models and behaviors.
In the future, we expect that the proposed dataset can be helpful to researchers working the cybersecurity domain and can be used as a modern benchmark network-intrusion dataset.

Author Contributions

Conceptualization, R.D., A.V., and S.G.; methodology, R.D., S.G., and J.T.; software, S.G., T.A., and P.S.; validation, J.T. and N.M.; formal analysis, R.D. and J.T.; investigation, J.T. and N.M.; resources, A.V. and S.G.; data curation, S.G., T.A., and P.S.; writing—original draft preparation, R.D.; writing—review and editing, A.V., S.G., and J.T.; visualization, R.D., S.G., and J.T.; supervision, A.V.; funding acquisition, A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported, in part, by the European Union’s Horizon 2020 research and innovation program, under Grant Agreement No. 830892, project “Strategic programs for advanced research and technology in Europe” (SPARTA).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Laštovička, M.; Čeleda, P. Situational Awareness: Detecting Critical Dependencies and Devices in a Network. In Security of Networks and Services in an All-Connected World; AIMS 2017, Zurich, Switzerland, 10–13 July 2017, Lecture Notes in Computer Science; Tuncer, D., Koch, R., Badonnel, R., Stiller, B., Eds.; Springer: Cham, Switzerland, 2017; Volume 10356, pp. 173–178. [Google Scholar]
  2. Liu, B.; Bi, J.; Vasilakos, A.V. Toward Incentivizing Anti-Spoofing Deployment. IEEE Trans. Inf. Forensics Secur. 2014, 9, 436–450. [Google Scholar] [CrossRef]
  3. Yao, G.; Bi, J.; Vasilakos, A.V. Passive IP Traceback: Disclosing the Locations of IP Spoofers from Path Backscatter. IEEE Trans. Inf. Forensics Secur. 2015, 10, 471–484. [Google Scholar] [CrossRef]
  4. Luo, H.; Chen, Z.; Li, J.; Vasilakos, A.V. Preventing Distributed Denial-of-Service Flooding Attacks With Dynamic Path Identifiers. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1801–1815. [Google Scholar] [CrossRef]
  5. Jang-Jaccard, J.; Nepal, S. A survey of emerging threats in cybersecurity. J. Comput. Syst. Sci. 2014, 80, 973–993. [Google Scholar] [CrossRef]
  6. Venčkauskas, A.; Morkevicius, N.; Jukavičius, V.; Damaševičius, R.; Toldinas, J.; Grigaliūnas, S. An edge-fog secure self-authenticable data transfer protocol. Sensors 2019, 19, 3612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Jing, Q.; Vasilakos, A.V.; Wan, J.; Lu, J.; Qiu, D. Security of the Internet of Things: Perspectives and challenges. Wirel. Netw. 2014, 20, 2481–2501. [Google Scholar] [CrossRef]
  8. Zhou, J.; Cao, Z.; Dong, X.; Vasilakos, A.V. Security and Privacy for Cloud-Based IoT: Challenges. IEEE Commun. Mag. 2017, 55, 26–33. [Google Scholar] [CrossRef]
  9. Liao, H.J.; Lin, C.H.R.; Lin, Y.C.; Tung, K.Y. Intrusion detection system: A comprehensive review. J. Netw. Comput. Appl. 2013, 36, 16–24. [Google Scholar] [CrossRef]
  10. Azeez, N.A.; Ayemobola, T.J.; Misra, S.; Maskeliūnas, R.; Damaševičius, R. Network intrusion detection with a hashing based apriori algorithm using Hadoop MapReduce. Computers 2019, 8, 86. [Google Scholar] [CrossRef] [Green Version]
  11. Nisioti, A.; Mylonas, A.; Yoo, P.D.; Katos, V. From intrusion detection to attacker attribution: A comprehensive survey of unsupervised methods. IEEE Commun. Surv. Tutor. 2018, 20, 3369–3388. [Google Scholar] [CrossRef]
  12. Wazid, M.; Das, A.K.; Bhat, K.V.; Vasilakos, A.V. LAM-CIoT: Lightweight authentication mechanism in cloud-based IoT environment. J. Netw. Comput. Appl. 2020, 150, 102496. [Google Scholar] [CrossRef]
  13. Yu, Y.; Xue, L.; Au, M.H.; Susilo, W.; Ni, J.; Zhang, Y.; Vasilakos, A.V.; Shen, J. Cloud data integrity checking with an identity-based auditing mechanism from RSA. Future Gener. Comput. Syst. 2016, 62, 85–91. [Google Scholar] [CrossRef]
  14. Wei, W.; Woźniak, M.; Damaševičius, R.; Fan, X.; Li, Y. Algorithm Research of Known-plaintext Attack on Double Random Phase Mask Based on WSNs. J. Internet Technol. 2019, 20, 39–48. [Google Scholar] [CrossRef]
  15. Challa, S.; Das, A.K.; Odelu, V.; Kumar, N.; Kumari, S.; Khan, M.K.; Vasilakos, A.V. An efficient ECC-based provably secure three-factor user authentication and key agreement protocol for wireless healthcare sensor networks. Comput. Electr. Eng. 2018, 69, 534–554. [Google Scholar] [CrossRef]
  16. Khan, K.; Mehmood, A.; Khan, S.; Khan, M.A.; Iqbal, Z.; Mashwani, W.K. A survey on intrusion detection and prevention in wireless ad-hoc networks. J. Syst. Archit. 2020, 105, 101701. [Google Scholar] [CrossRef]
  17. Wu, W.; Li, R.; Xie, G.; An, J.; Bai, Y.; Zhou, J.; Li, K. A survey of intrusion detection for in-vehicle networks. IEEE Trans. Intell. Transp. Syst. 2020, 21, 919–933. [Google Scholar] [CrossRef]
  18. Hande, Y.; Muddana, A. A survey on intrusion detection system for software defined networks (SDN). Int. J. Bus. Data Commun. Netw. 2020, 16, 28–47. [Google Scholar] [CrossRef]
  19. Shu, Z.; Wan, J.; Li, D.; Lin, J.; Vasilakos, A.V.; Imran, M. Security in Software-Defined Networking: Threats and Countermeasures. Mob. Netw. Appl. 2016, 21, 764–776. [Google Scholar] [CrossRef]
  20. Li, Y.; Xu, Y.; Liu, Z.; Hou, H.; Zheng, Y.; Xin, Y.; Cui, L. Robust detection for network intrusion of industrial IoT based on multi-CNN fusion. Meas. J. Int. Meas. Confed. 2020, 154. [Google Scholar] [CrossRef]
  21. Farivar, F.; Haghighi, M.S.; Jolfaei, A.; Alazab, M. Artificial Intelligence for Detection, Estimation, and Compensation of Malicious Attacks in Nonlinear Cyber-Physical Systems and Industrial IoT. IEEE Trans. Ind. Inform. 2020, 16, 2716–2725. [Google Scholar] [CrossRef]
  22. Wazid, M.; Das, A.K.; Kumar, N.; Vasilakos, A.V.; Rodrigues, J.J.P.C. Design and Analysis of Secure Lightweight Remote User Authentication and Key Agreement Scheme in Internet of Drones Deployment. IEEE Internet Things J. 2019, 6, 3572–3584. [Google Scholar] [CrossRef]
  23. Lin, C.; He, D.; Huang, X.; Choo, K.-K.R.; Vasilakos, A.V. BSeIn: A blockchain-based secure mutual authentication with fine-grained access control system for industry 4.0. J. Netw. Comput. Appl. 2018, 116, 42–52. [Google Scholar] [CrossRef]
  24. Wazid, M.; Das, A.K.; Kumar, N.; Conti, M.; Vasilakos, A.V. A Novel Authentication and Key Agreement Scheme for Implantable Medical Devices Deployment. IEEE J. Biomed. Health Inform. 2018, 22, 1299–1309. [Google Scholar] [CrossRef] [PubMed]
  25. Shalaginov, A.; Semeniuta, O.; Alazab, M. MEML: Resource-aware MQTT-based Machine Learning for Network Attacks Detection on IoT Edge Devices. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion—UCC ’19 Companion, Auckland, New Zealand, 2–5 December 2019. [Google Scholar] [CrossRef] [Green Version]
  26. Zhou, J.; Dong, X.; Cao, Z.; Vasilakos, A.V. Secure and Privacy Preserving Protocol for Cloud-Based Vehicular DTNs. IEEE Trans. Inf. Forensics Secur. 2015, 10, 1299–1314. [Google Scholar] [CrossRef]
  27. Yan, Z.; Zhang, P.; Vasilakos, A.V. A security and trust framework for virtualized networks and software-defined networking. Secur. Commun. Netw. 2015, 9, 3059–3069. [Google Scholar] [CrossRef]
  28. Wazid, M.; Das, A.K.; Kumar, N.; Vasilakos, A.V. Design of secure key management and user authentication scheme for fog computing services. Future Gener. Comput. Syst. 2019, 91, 475–492. [Google Scholar] [CrossRef]
  29. Odusami, M.; Abayomi-Alli, O.; Misra, S.; Shobayo, O.; Damasevicius, R.; Maskeliunas, R. Android Malware Detection: A Survey. In Applied Informatics—First International Conference, ICAI 2018, Bogotá, Colombia, 1–3 November 2018. Commun. Comput. Inf. Sci. 2018, 942, 255–266. [Google Scholar] [CrossRef]
  30. Rajagopal, S.; Kundapur, P.P.; Hareesha, K.S. A stacking ensemble for network intrusion detection using heterogeneous datasets. Secur. Commun. Netw. 2020. [Google Scholar] [CrossRef] [Green Version]
  31. Odusami, M.; Misra, S.; Adetiba, E.; Abayomi-Alli, O.; Damasevicius, R.; Ahuja, R. An Improved Model for Alleviating Layer Seven Distributed Denial of Service Intrusion on Webserver. J. Phys. Conf. Ser. 2019, 1235, 012020. [Google Scholar] [CrossRef]
  32. Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Network Anomaly Detection: Methods, Systems and Tools. IEEE Commun. Surv. Tutor. 2014, 16, 303–336. [Google Scholar] [CrossRef]
  33. Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J. Survey of intrusion detection systems: Techniques, datasets and challenges. Cybersecurity 2019, 2, 20. [Google Scholar] [CrossRef]
  34. Alhaj, T.A.; Siraj, M.M.; Zainal, A.; Elshoush, H.T.; Elhaj, F. Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation. PLoS ONE 2016, 11, e0166017. [Google Scholar] [CrossRef] [Green Version]
  35. Ramaki, A.A.; Atani, R.E. A survey of IT early warning systems: Architectures, challenges, and solutions. Secur. Commun. Netw. 2016, 9, 4751–4776. [Google Scholar] [CrossRef]
  36. Divekar, A.; Parekh, M.; Savla, V.; Mishra, R.; Shirole, M. Benchmarking datasets for anomaly-based network intrusion detection: KDD CUP 99 alternatives. In Proceedings of the 2018 IEEE 3rd International Conference on Computing, Communication and Security, ICCCS 2018, Katmandu, Nepal, 25–27 October 2018; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
  37. Siddique, K.; Akhtar, Z.; Aslam Khan, F.; Kim, Y. KDD Cup 99 Data Sets: A Perspective on the Role of Data Sets in Network Intrusion Detection Research. Computer 2019, 52, 41–51. [Google Scholar] [CrossRef]
  38. Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the Second IEEE Symposium on Computational Intelligence for Security and Defence Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
  39. Elkhadir, Z.; Mohammed, B. A cyber network attack detection based on GM median nearest neighbors LDA. Comput. Secur. 2019, 86, 63–74. [Google Scholar] [CrossRef]
  40. Gao, L.; Li, Y.; Zhang, L.; Lin, F.; Ma, M. Research on detection and defense mechanisms of DoS attacks based on BP neural network and game theory. IEEE Access 2019, 7, 43018–43030. [Google Scholar] [CrossRef]
  41. Yao, H.; Fu, D.; Zhang, P.; Li, M.; Liu, Y. MSML: A novel multilevel semi-supervised machine learning framework for intrusion detection system. IEEE Internet Things J. 2019, 6, 1949–1959. [Google Scholar] [CrossRef] [Green Version]
  42. Yao, H.; Wang, Q.; Wang, L.; Zhang, P.; Li, M.; Liu, Y. An intrusion detection framework based on hybrid multi-level data mining. Int. J. Parallel Program. 2019, 47, 740–758. [Google Scholar] [CrossRef] [Green Version]
  43. Alkasassbeh, M.; Al-Naymat, G.; Hassanat, A.; Almseidin, M. Detecting Distributed Denial of Service Attacks Using Data Mining Techniques. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2016, 7, 436–445. [Google Scholar] [CrossRef] [Green Version]
  44. Creech, G.; Hu, J. Generation of a new IDS test dataset: Time to retire the KDD collection. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC), Shanghai, China, 7–10 April 2013; pp. 4487–4492. [Google Scholar] [CrossRef]
  45. Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the development of realistic botnet dataset in the Internet of Things for network forensic analytics: Bot-IoT dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef] [Green Version]
  46. Bhattacharya, S.; Selvakumar, S. SSENet-2014 dataset: A dataset for detection of multiconnection attacks. In Proceedings of the 3rd International Conference on Eco-Friendly Computing and Communication Systems, ICECCS 2014, Mangalore, India, 18–21 December 2014; pp. 121–126. [Google Scholar] [CrossRef]
  47. Cordero, C.G.; Vasilomanolakis, E.; Milanov, N.; Koch, C.; Hausheer, D.; Muhlhauser, M. ID2T: A DIY dataset creation toolkit for intrusion detection systems. In Proceedings of the 2015 IEEE Conference on Communications and Network Security, CNS 2015, Florence, Italy, 28–30 September 2015; pp. 739–740. [Google Scholar] [CrossRef]
  48. Singh, R.; Kumar, H.; Singla, R.K. A reference dataset for network traffic activity based intrusion detection system. Int. J. Comput. Commun. Control 2015, 10, 390–402. [Google Scholar] [CrossRef] [Green Version]
  49. Belenko, V.; Krundyshev, V.; Kalinin, M. Synthetic datasets generation for intrusion detection in VANET. In Proceedings of the 11th International Conference on Security of Information and Networks, Cardiff, UK, 10–12 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
  50. Vasilomanolakis, E.; Cordero, C.G.; Milanov, N.; Mühlhäuser, M. Towards the creation of synthetic, yet realistic, intrusion detection datasets. In Proceedings of the 2016 IEEE/IFIP Network Operations and Management Symposium, NOMS 2016, Istanbul, Turkey, 25–29 April 2016; pp. 1209–1214. [Google Scholar] [CrossRef]
  51. Magán-Carrión, R.; Urda, D.; Díaz-Cano, I.; Dorronsoro, B. Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches. Appl. Sci. 2020, 10, 1775. [Google Scholar] [CrossRef] [Green Version]
  52. Casas, P.; Mazel, J.; Owezarski, P. Unsupervised network intrusion detection systems: Detecting the unknown without knowledge. Comput. Commun. 2012, 35, 772–783. [Google Scholar] [CrossRef] [Green Version]
  53. Kanda, Y.; Fontugne, R.; Fukuda, K.; Sugawara, T. ADMIRE: Anomaly detection method using entropy based PCA with three-step sketches. Comput. Commun. 2013, 36, 575–588. [Google Scholar] [CrossRef]
  54. Meira, J.; Andrade, R.; Praça, I.; Carneiro, J.; Bolón-Canedo, V.; Alonso-Betanzos, A.; Marreiros, G. Performance evaluation of unsupervised techniques in cyber-attack anomaly detection. J. Ambient Intell. Humaniz. Comput. 2019. [Google Scholar] [CrossRef]
  55. Umer, M.F.; Sher, M.; Bi, Y. A two-stage flow-based intrusion detection model for next-generation networks. PLoS ONE 2018, 13. [Google Scholar] [CrossRef] [Green Version]
  56. Fadlullah, Z.M.; Taleb, T.; Vasilakos, A.V.; Guizani, M.; Kato, N. DTRAB: Combating Against Attacks on Encrypted Protocols Through Traffic-Feature Analysis. IEEE/ACM Trans. Netw. 2010, 18, 1234–1247. [Google Scholar] [CrossRef]
  57. Zhang, J.; Chen, C.; Xiang, Y.; Zhou, W.; Vasilakos, A.V. An Effective Network Traffic Classification Method with Unknown Flow Detection. IEEE Trans. Netw. Serv. Manag. 2013, 10, 133–147. [Google Scholar] [CrossRef]
  58. Ring, M.; Wunderlich, S.; Scheuring, D.; Landes, D.; Hotho, A. A survey of network-based intrusion detection data sets. Comput. Secur. 2019, 86, 147–167. [Google Scholar] [CrossRef] [Green Version]
  59. Moustafa, N.; Slay, J. UNSW-NB15: A Comprehensive Data Set for Network Intrusion Detection Systems. In Military Communications and Information Systems Conference (MilCIS); IEEE: Canberra, Australia, 2015; pp. 1–6. [Google Scholar]
  60. Sharafaldin, I.; Habibi Lashkari, A.; Ghorbani, A.A. A Detailed Analysis of the CICIDS2017 Data Set. In ICISSP; Revised Selected Papers; Springer: Cham, Switzerland, 2018; pp. 172–188. [Google Scholar]
  61. Maciá-Fernández, G.; Camacho, J.; Magán-Carrión, R.; García-Teodoro, P.; Therón, R. UGR’16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. Comput. Secur. 2018, 73, 411–424. [Google Scholar] [CrossRef] [Green Version]
  62. UNB CSE-CIC-IDS2018 on AWS. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 9 May 2020).
  63. Krundyshev, V.M. Preparing datasets for training in a neural network system of intrusion detection in industrial systems. Autom. Control Comput. Sci. 2019, 53, 1012–1016. [Google Scholar] [CrossRef]
  64. Almomani, I.; Al-Kasasbeh, B.; AL-Akhras, M. WSN-DS: A Dataset for Intrusion Detection Systems in Wireless Sensor Networks. J. Sens. 2016, 2016, 1–16. [Google Scholar] [CrossRef] [Green Version]
  65. Al-Hadhrami, Y.; Hussain, F.K. Real time dataset generation framework for intrusion detection systems in IoT. Future Gener. Comput. Syst. 2020, 108, 414–423. [Google Scholar] [CrossRef]
  66. Zago, M.; Gil Pérez, M.; Martínez Pérez, G. UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection. Data Brief 2020, 30. [Google Scholar] [CrossRef]
  67. Ahmed, M.; Naser Mahmood, A.; Hu, J. A survey of network anomaly detection techniques. J. Netw. Comput. Appl. 2016, 60, 19–31. [Google Scholar] [CrossRef]
  68. Moustafa, N.; Hu, J.; Slay, J. A holistic review of network anomaly detection systems: A comprehensive survey. J. Netw. Comput. Appl. 2019, 128, 33–55. [Google Scholar] [CrossRef]
  69. Salo, F.; Injadat, M.; Nassif, A.B.; Shami, A.; Essex, A. Data mining techniques in intrusion detection systems: A systematic literature review. IEEE Access 2018, 6, 56046–56058. [Google Scholar] [CrossRef]
  70. DARPA Intrusion Detection Evaluation Dataset. MIT Lincoln Lab. Available online: https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset (accessed on 9 May 2020).
  71. KDD Cup 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 9 May 2020).
  72. Moustafa, N.; Slay, J. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Inf. Secur. J. A Glob. Perspect. 2016, 25, 18–31. [Google Scholar] [CrossRef]
  73. Smith, R.; Japkowicz, N.; Dondo, M.; Mason, P. Using unsupervised learning for network alert correlation Advances in Artificial Intelligence. In Proceedings of the Canadian Conference on AI 2008, Windsor, Canada, 28–30 May 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 308–319. [Google Scholar]
  74. Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Towards generating real-life datasets for network intrusion detection. Int. J. Netw. Secur. 2015, 17, 683–701. [Google Scholar]
  75. Hafeez, I.; Antikainen, M.; Ding, A.Y.; Tarkoma, S. IoT-KEEPER: Detecting Malicious IoT Network Activity using Online Traffic Analysis at the Edge. IEEE Trans. Netw. Serv. Manag. 2020, 17, 45–59. [Google Scholar] [CrossRef] [Green Version]
  76. Wahid, A.; Leckie, C.; Zhou, C. Estimating the number of hosts corresponding to an intrusion alert while preserving privacy. J. Comput. Syst. Sci. 2014, 80, 502–519. [Google Scholar] [CrossRef]
  77. Eid, H.F.; Darwish, A.; Hassanien, A.E.; Abraham, A. Principle Components Analysis and Support Vector Machine based Intrusion Detection System. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications ISDA, Cairo, Egypt, 29 November–1 December 2010; pp. 363–367. [Google Scholar]
  78. Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  79. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; AAAI Press: Menlo Park, CA, 1996; pp. 226–231. [Google Scholar]
  80. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. LITNET (Lithuanian Research and Education Network) network environment used for the collection of the dataset.
Figure 1. LITNET (Lithuanian Research and Education Network) network environment used for the collection of the dataset.
Electronics 09 00800 g001
Figure 2. Example of network traffic flow.
Figure 2. Example of network traffic flow.
Electronics 09 00800 g002
Figure 3. Processing of data and formation of the proposed dataset.
Figure 3. Processing of data and formation of the proposed dataset.
Electronics 09 00800 g003
Figure 4. Feature value distributions observed in network flows as captured in the proposed dataset: (a) unique source IPs, (b) unique destination IPs, (c) unique source ports, and (d) unique destination ports.
Figure 4. Feature value distributions observed in network flows as captured in the proposed dataset: (a) unique source IPs, (b) unique destination IPs, (c) unique source ports, and (d) unique destination ports.
Electronics 09 00800 g004
Figure 5. Maximum flows and number of unique IP addresses in NetFlow window slices.
Figure 5. Maximum flows and number of unique IP addresses in NetFlow window slices.
Electronics 09 00800 g005
Figure 6. Maximum number of ports and the number of unique ports in NetFlow window slices.
Figure 6. Maximum number of ports and the number of unique ports in NetFlow window slices.
Electronics 09 00800 g006
Figure 7. The distribution of data, according to the protocol types.
Figure 7. The distribution of data, according to the protocol types.
Electronics 09 00800 g007
Figure 8. The frequency of the source and destination ports in the NetFlow connections.
Figure 8. The frequency of the source and destination ports in the NetFlow connections.
Electronics 09 00800 g008
Figure 9. The statistical distribution of connection flags (TCP SYN-flood).
Figure 9. The statistical distribution of connection flags (TCP SYN-flood).
Electronics 09 00800 g009
Figure 10. Unique IPs versus flow number and Ports addressed during the Scan spread attack.
Figure 10. Unique IPs versus flow number and Ports addressed during the Scan spread attack.
Electronics 09 00800 g010
Figure 11. Values of time-slice connection features (from Table 7).
Figure 11. Values of time-slice connection features (from Table 7).
Electronics 09 00800 g011
Figure 12. Statistical distribution of time-slice-based connection feature values from Table 7.
Figure 12. Statistical distribution of time-slice-based connection feature values from Table 7.
Electronics 09 00800 g012
Figure 13. The results of density-based clustering of low-dimensional feature embedding obtained by using the t-stochastic Neighbor Embedding (t-SNE) method.
Figure 13. The results of density-based clustering of low-dimensional feature embedding obtained by using the t-stochastic Neighbor Embedding (t-SNE) method.
Electronics 09 00800 g013
Figure 14. Distribution of time-slice feature values in time. The y-axis shows the normalized values of features.
Figure 14. Distribution of time-slice feature values in time. The y-axis shows the normalized values of features.
Electronics 09 00800 g014
Figure 15. Results of the Nemenyi test evaluating 12 time-slice-based connection features: one vs. all (a) and split-half (b).
Figure 15. Results of the Nemenyi test evaluating 12 time-slice-based connection features: one vs. all (a) and split-half (b).
Electronics 09 00800 g015aElectronics 09 00800 g015b
Table 1. Types of attacks supported by the analyzed network-intrusion datasets.
Table 1. Types of attacks supported by the analyzed network-intrusion datasets.
Network AttackDatasets
DDoS 2016 [43]UNSW-NB15 [59]CICIDS 2017 [60]UGR’16
[61]
NSL-KDD [38]CSE-CIC-IDS2018 [62]
FuzzersNoYesNoNoNoNo
GenericNoYesYesNoNoYes
VirusNoNoNoNoNoNo
WormNoYesNoNoNoNo
TrojanNoYesNoNoNoNo
DoSYesYesYesYesYesYes
DDoSYesNoYesNoNoYes
Network AttackNoNoNoNoNoNo
Physical AttackNoNoNoNoNoNo
Information Gathering AttackNoYesNoYesNoNo
User to Root (U2R) NoYesNoNoYesNo
Remote to Local (R2L)NoNoNoNoYesNo
ProbeYesNoNoYesYesNo
Brute-force NoNoYesNoNoYes
WebNoNoYesNoNoYes
InfiltrationNoYesYesNoNoYes
BotnetNoNoYesYesNoYes
Table 2. Comparison of reference and recent network intrusion datasets.
Table 2. Comparison of reference and recent network intrusion datasets.
Data-SetReference DatasetsMore-Recent Datasets
DARPA’98 [70]KDD Cup’99 [71]DDoS 2016 [43]UNSW-NB15 [59] CICIDS 2017 [60] UGR’16
[61]
Year199819992016201820172016
Type of trafficSynthetic
network
traffic
Synthetic
network
traffic
Randomized to obtain realistic resultsSynthetic
network
traffic
B-Profile systemReal network traffic
with realistic attacks
Raw binary
Data
4 GB n/an/a100 GB of the raw trafficn/a14 GB
How it was col-lectedtcpdumpbuilt from the DARPA’98 datasetA network simulator (NS2)tcpdump and IXIA traffic generator PerfectStormuser behavior based on FTP, email, HTTP, HTTPS, and SSH protocolsnetflow traces
Col-lection time7 weeksn/an/a31 h5 daysmore than 4 months
No. of records5 M records4.9 M single connection
vectors
734,627 records2 Millionn/a16,900 M
Label-ed attack typesn/aDoS, User to Root
(U2R), Remote to Local (R2L) and Probing Attack
DDoS attack (HTTP Flood, SIDDOS, UDP Flood, and Smurf)Fuzzers, Backdoors, DoS, Exploits, Generic, ReconnaissAnce, Shellcode, WormsBotnet, Brute Force SSH, DoS, DDoS, FTP,
Infiltration, Heartbleed, Web Attack
DoS, Scan,
Botnet (synthetic),
IP in blacklist,
UDP Scan,
SSH Scan,
SPAM, anomaly
Table 3. Descriptive characteristics of dataset.
Table 3. Descriptive characteristics of dataset.
CharacteristicSmurfICMP FloodUDP FloodSYN floodHTTP FloodLANDW32.BlasterCode RedSPAMReaper WormScanFrag-mentation
No of flows59,47911,62893,5833,725,83822,95952,41724,2911,255,70274711766232477
Source bytes15,471,9662,217,91018,506,9622,195,053,12015,138,94530,257,20011,636,600433,310,2701,247,220276,9601,893,620116,883,130
Source packets96,88438,523191,74654,827,825194,350756,430290,9159,850,28010,555692436,410149,595
Protocol typeTCP0003,725,83822,95952,41724,2911,255,702727117663230
UDP0093,58300000000477
IPv6001000000000
ICMP59,47911,6280000000000
Others000000000000
Unique IPsSource3865187131873,2111115
Destination 14900570876,188138,81524,00112311536229273
Unique portsSource1146,531411717,239411727112862311
Destination 53123016254364,5111121
Table 4. Distribution of attacks in the dataset.
Table 4. Distribution of attacks in the dataset.
Type of AttackNumber of FlowsNumber of Attacks (Flows)
Smurf3,994,42659,479
ICMP-flood3,863,65511,628
UDP-flood606,81493,583
TCP SYN-flood14,608,6783,725,838
HTTP-flood3,963,16822,959
LAND attack3,569,83852,417
Blaster Worm2,858,57324,291
Code Red Worm5,082,9521,255,702
Spam bot’s detection1,153,020747
Reaper Worm4,377,6561176
Scanning/Spread66876232
Packet fragmentation attack1,244,866477
TOTAL45,330,3335,328,934
Table 5. Flow additional attributes for attack identification.
Table 5. Flow additional attributes for attack identification.
No.Name of FeaturesAttributeDescription
1icmp_dst_ip_bicmp_smfFlooding network broadcast with ICMP packets
2icmp_src_ipicmp_fFlooding target with ICMP packets
3udp_dst_pudp_fDdos’ing with UDP traffic
4tcp_f_stcp_syn_fFlooding attack with SYN packets
5tcp_f_n_atcp_syn_fFlooding attack with SYN packets
6tcp_f_n_ftcp_syn_fFlooding attack with SYN packets
7tcp_f_n_rtcp_syn_fFlooding attack with SYN packets
8tcp_f_n_ptcp_syn_fFlooding attack with SYN packets
9tcp_f_n_utcp_syn_fFlooding attack with SYN packets
10tcp_dst_phttp_fDdos’ing with HTTP traffic
11tcp_src_dst_f_stcp_landLanding type of attack to any port with SYN packets
12tcp_src_tftptcp_w32_wFlooding TFTP service
13tcp_src_kerbtcp_w32_wFlooding Kerberos service
14tcp_src_rpctcp_w32_wFlooding RPC service
15tcp_dst_p_srctcp_red_wUses a vulnerability in a HTTP server
16smtp_dstsmtp_bFlooding with SMTP connections from one host
17udp_p_r_rangeudp_reaper_wScans on UDP ports 80, 8080, 81, 88, 8081, 82, 83, 84, 1080, 3000, 3749, 8001, 8060, 8090, 8443, 8880, and 10,000.
18p_range_dsttcp_udp_win_pSeveral ports, one address; single port, multiple addresses, ports NBT, Samba, MS-SQL-S, VNC, RDP, 2222
19udp_src_p_0udp_0Ddos’ing with UDP fragmented traffic
Table 6. Additional generated attributes.
Table 6. Additional generated attributes.
No.Type of AttackActionAttributeDescription
1SmurfFloodicmp_smfBroadcast requests to the network on behalf of the victim computer
2ICMP-floodFloodicmp_fLarge traffic of ICMP packets
3UDP-floodFloodudp_fLarge traffic flow to DNS
4TCP SYN-floodFloodtcp_syn_fLarge traffic of TCP traffic with SYN attack
5HTTP-floodFloodhttp_fHigh traffic with HTTP protocol
6LAND attackAttacktcp_landThe IP address of the target is indicated in the header of such an IP packet as the destination and departure addresses, and any open port on the system under attack is indicated as the destination and departure ports
7W32.Blaster WormWormtcp_w32_wLarge volume of traffic on Remote Procedure Call (RPC) port, for TFTP port, Kerberos authentication port
8Code Red WormWormtcp_red_wUses a vulnerability in a web server
9SAPM botsBotssmtp_bThe presence of an excessively large number of SMTP connections
10Reaper WormWormudp_reaper_wReaper is a botnet that uses the HTTP-based exploits of known vulnerabilities in IoT
11Scanning/SpreadAttacktcp_udp_win_pAn abnormal number of connections from one host to one or more other hosts
12Packet fragmentationAttackudp_0Denial of service attacks are based on the use of many fragmented packets
Table 7. Time-slice-based connection features.
Table 7. Time-slice-based connection features.
No.Description
1No. of unique source IP addresses in previous 10,000 connections
2No. of unique destination IP addresses in previous 10,000 connections
3No. of unique source ports in previous 10,000 connections
4No. of unique destination ports in previous 10,000 connections
5The largest count of connections from the same source IP address in previous 10,000 connections
6The largest count of connections from the same destination IP address port in previous 10,000 connections
7The largest count of connections from the same source port in previous 10,000 connections
8The largest count of connections from the same destination port in previous 10,000 connections
9The average count of connections from the same destination IP address and port in previous 10,000 connections
10The average count of connections from the same source port in previous 10,000 connections
11No. of connections with unique source IP–destination IP address pairs in previous 10,000 connections
12The largest count of connections with the same pair of source IP–destination IP addresses in previous 10,000 connections
Table 8. Labeled attributes.
Table 8. Labeled attributes.
Name of FeaturesDescription
attack_t(attack type) nine types: Smurf, ICMP-flood, UDP-flood, TCP SYN-flood, HTTP-flood, LAND attack, W32.Blaster Worm, Code Red Worm, SAPM bots, Reaper Worm, Scanning/Spread, Packet fragmentation or none
attack_a(attack action) 0 for typical (background) traffic and 1 for network attack
Table 9. Comparison of benchmark network datasets.
Table 9. Comparison of benchmark network datasets.
Para-meterDDoS 2016 [43]UNSW-NB15 [59]CICIDS 2017 [60]UGR’16
[61]
NSL-KDD [38]CSE-CIC-IDS2018 [62]Pro-Posed Dataset
No. of network nodesNetwork simulator (NS2)3210 M different (sub)networksn/a5 subnets1,395,951 IPs
Number of unique IPsn/a4521higher than 600 Mn/a5007,394,481
Period of data collectionn/a15/16 h5 dayscalibration set —100 days,
test set—1 month
7 weeks17 days10 months
Attack classes59774712
No. of features27498012418085

Share and Cite

MDPI and ACS Style

Damasevicius, R.; Venckauskas, A.; Grigaliunas, S.; Toldinas, J.; Morkevicius, N.; Aleliunas, T.; Smuikys, P. LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection. Electronics 2020, 9, 800. https://doi.org/10.3390/electronics9050800

AMA Style

Damasevicius R, Venckauskas A, Grigaliunas S, Toldinas J, Morkevicius N, Aleliunas T, Smuikys P. LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection. Electronics. 2020; 9(5):800. https://doi.org/10.3390/electronics9050800

Chicago/Turabian Style

Damasevicius, Robertas, Algimantas Venckauskas, Sarunas Grigaliunas, Jevgenijus Toldinas, Nerijus Morkevicius, Tautvydas Aleliunas, and Paulius Smuikys. 2020. "LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection" Electronics 9, no. 5: 800. https://doi.org/10.3390/electronics9050800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop