Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns

Fan, Mingrui; Gao, Jiaqi; He, Yaru; Shi, Weidong; Lu, Yueming

doi:10.3390/electronics13050930

Open AccessArticle

Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns

by

Mingrui Fan

¹,

Jiaqi Gao

²,

Yaru He

²,

Weidong Shi

¹ and

Yueming Lu

^1,*

¹

Key Laboratory of Trustworthy Distributed Computing and Service, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(5), 930; https://doi.org/10.3390/electronics13050930

Submission received: 8 January 2024 / Revised: 24 January 2024 / Accepted: 29 January 2024 / Published: 29 February 2024

(This article belongs to the Special Issue Advances in IoT Security)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Traffic fingerprint was considered an effective security protection mechanism in IoT scenarios because it can be used to automatically identify accessed devices. However, the results of replication experiments show that the classic traffic fingerprints based on simple network traffic attribute features have a significantly lower ability to identify accessed devices in real 5G IoT scenarios compared to what was stated in traditional IoT scenarios. The growing homogenization of IoT traffic caused by the application of 5G is believed to be the reason for the poor ability of traditional traffic fingerprints to identify 5G IoT terminals. Studying an enhanced traffic fingerprint is necessary to accommodate the homogeneous Internet of Things traffic. In addition, during the reproducing experiments, we noticed that the solution of overlap is a key factor that restricts the recognition ability of one-vs-all multi-classifiers, and the efficiency of existing methods still has some room for optimization. Based on targeted improvements to these two issues, we proposed an enhanced IoT terminal traffic fingerprint based on packet payload transition patterns to improve the device recognition ability in homogeneous IoT traffic. Additionally, we designed an improved solution for overlap based on density centers to expedite decision making. According to the experimental results, when compared with the existing traffic fingerprint, the proposed traffic fingerprint in this study demonstrated a Macro-Average Precision of close to 90% for network traffic from real 5G IoT terminals. The proposed overlap solution based on the density centers reduced the decision-making time from hundreds of seconds to tens of seconds while ensuring decision-making accuracy.

Keywords:

IoT; overlap solution; packet payload transition; traffic fingerprint

1. Introduction

Due to the wide distribution of IoT terminals and their sensitivity to different cyber threats [1], the effectiveness of static profile-based IoT device recognition methods is limited. IoT device recognition based on traffic fingerprints has attracted much attention from researchers as an important auxiliary means. In recent years, there has been some excellent research work in the field of traffic fingerprinting, and most of them rely on network traffic features.

However, we found that the existing approaches based on classic network traffic features show limitations with the application of new technologies represented by 5G and cloud computing. Generally, traditional features capture hardware implementation-related information from network traffic, which can be used to identify IoT devices. For example, the TCP window size is related to the memory of the device, and the number of bytes sent or received per second is related to the processor. While, with the application of 5G, IoT devices based on different hardware communicate via a unified standard abstract interface, which makes network traffic homogeneous [2]. Network traffic contains less information about the underlying implementations, so the traditional features do not provide enough differentiation between IoT devices. To demonstrate the above view, we reproduce the four traffic fingerprints based on classic features [3,4,5,6] in the real IoT traffic from 5G IoT terminals provided by the Shenzhen Power Supply Bureau. According to the experimental results, the performance of the same traffic fingerprint was significantly higher in the traditional IoT dataset compared to the 5G IoT dataset, which indicated the limitations of traditional methods in the homogenized network traffic. In addition, in the experiment with the one-vs-all SVM as the classifier, we found that the overlap phenomenon always inevitably appeared and it harmed the ability of classifiers to identify the accessed devices based on traffic fingerprints. Although researchers have noticed this problem and proposed a solution based on the maximum average similarity [6], its computational complexity increases significantly with the increase in the number of training samples, which hinders the classifier from improving the accuracy with more training samples.

Given the limitations of existing IoT traffic fingerprint research work in 5G IoT scenarios, we proposed an enhanced traffic fingerprint based on the packet payload transition patterns which is suitable for homogeneous IoT traffic. According to the experimental results, when using classifiers based on the tree structure, the Macro-Average Precision f of the feature proposed in this study for 5G IoT terminals exceeded the baseline by 10–15%, and the Macro-Average Recall exceeded the baseline by nearly 15%. In addition, in order to speed up the calculation efficiency of overlap solutions, we proposed an overlap solution based on the density centers. The experimental results show that the proposed overlap solution achieved higher decision-making accuracy with lower computational overhead compared to the scheme based on maximum average similarity.

The main contribution of this paper is as follows:

The Performance of existing IoT device traffic fingerprints are evaluated in real 5G IoT traffic datasets and compared to their performance in traditional datasets
A feature space for the IoT terminal network session based on the transition pattern of packet payload is proposed, which improved the quality of IoT traffic fingerprints in homogenized network traffic.
A density center-based overlap solution is proposed, which significantly improved the calculation speed while maintaining the accuracy of decision making.

2. Background and Problem Statement

IoT device identification based on traffic fingerprints has attracted the attention of many researchers, and some classic methods have also been proposed. It is worth noting that most of them were implemented based on feature engineering combined with machine learning models or deep learning models. In this chapter, we sort out the relevant existing methods and briefly evaluate the differences in their performance in traditional IoT traffic and 5G IoT traffic.

Ahmet Aksoy et al. proposed an IoT device fingerprint named SysID in [3]. It can accurately identify the device type based on a single packet, and a genetic algorithm is used to remove redundant features from the system. The IoT Sentinel dataset [7] is used to evaluate the performance of SysID. Compared with the standard implementation of IoT Sentinel, SysID has a higher average accuracy and uses fewer feature dimensions. However, SysID only extracts statistical features from the header of a single packet, and there is less available information. It is difficult to accurately identify the type of access device when continuously monitoring the device.

Salma Abdalla et al. proposed a device fingerprint based on network flow features in their study [4]. The features used in this method include flow-based features and behavior features, and they are not specific to any particular protocol. Many papers attempt to analyze the packet sequence in the uplink session of the access device, but determining the exact length of this sequence is challenging. If the sequence is too short, the accuracy of recognition will decrease, and if it is too long, the real-time performance will decrease. The paper proposed that a better balance between the accuracy and real-time performance of the model can be achieved when the packet sequence length is between 20 and 21. In addition, the paper proposed to focus on three characteristics of TCP: payload data offset, TTL, and TCP window size. The TCP payload data offset indicates the position at which the data packet begins transmitting to the upper layer protocol. The TCP window size is the number of data packets that the receiving end can acknowledge and process in one acknowledgment, which indicates the device’s memory capacity and processing speed.

In [8], Lefoane et al. proposed a method to detect zombie devices in access devices based on traffic fingerprinting. This method selected features based on the Gini impurity score and K-Means, verifying that reducing feature noise improves the model.

Shahid et al. proposed in [5] to focus on four key features in device fingerprints: the size of the first N packets sent, the size of the first N packets received, and the inter-arrival times between the first N packets sent and received. Based on these four types of features, they compared the performance of several machine learning models, including LR, RF, and Adaboost. It should be noted that these four types of features are common statistical flow features and cannot describe deep behavioral patterns present in the flow.

Sandhya Aneja et al. used Convolutional Neural Networks (CNN) in [9] to analyze the sequence of packet arrival time intervals, which was extracted from the access device within a given time period or length. Convert the sequence into a grayscale image and utilize the CNN model to identify the device type associated with the grayscale image. In the simulation environment consisting of an iPad and an iPhone, the accuracy of the CNN model reached 86.7%. However, achieving excellent performance in a deep learning model requires a large amount of data for training. This can be challenging in scenarios where there are strict data privacy management requirements.

There have also been some IoT traffic fingerprints using machine learning models proposed in [10,11]. Due to its simplicity, the Support Vector Machine (SVM) [12], designed for binary classification, was widely used. A mature implementation for the SVM multi-classifier is probably the one-vs-all method called OvA-SVM [13]. It consists of n binary SVM classifiers, where n represents the number of labels. During the training process of the i-th classifier, the samples of the i-th class are used as positive samples, and other samples are used as negative samples. However, there is a problem with OvA-SVM. The issue is that either the result of each binary sub-classifier is positive, or the results of each binary sub-classifier is negative. When using one-vs-all multi-classifiers to implement IoT device identification, it is crucial to pay attention to these two special situations.

After analyzing the aforementioned research, we have discovered that in the IoT experimental environment consisting of traditional network technologies, the existing methods have demonstrated effective IoT device identification capability. However, it remains uncertain whether these capabilities can still be achieved in the 5G IoT environment with satisfactory performance. Therefore, we replicated four typical IoT device traffic fingerprints in the 5G IoT terminal network traffic provided by the Shenzhen Power Supply Bureau. These fingerprints are documented in Literature A [6], Literature B [4], Literature C [5], and Literature D [3]. The details of this replication experiment are explained in the subsequent section of the evaluation. For convenience, the conclusions obtained are directly introduced here; when using the same classifier, the performance of the four IoT traffic fingerprints in identifying device types in real 5G IoT terminal traffic was significantly lower than that achieved in traditional IoT scenarios. The deepening of traffic homogeneity is considered to be the reason for the performance degradation. As shown in Figure 1, IoT has exhibited the characteristics of virtualization and cloud platforms after the introduction of 5G technology, of which the cloud platform uniformly schedules and controls wireless access for IoT terminals via the air interface. Different IoT terminals interact with the cloud platform based on unified specifications, resulting in a high degree of traffic homogeneity.

In addition, in the aforementioned experiment, we observed that the increasing uniformity of traffic on 5G IoT terminals leads to a more frequent occurrence of the overlap phenomenon when using one-vs-all multi-classifiers. The overlap solution greatly restricts the performance of device identification. The term “overlap” refers to the phenomenon where multiple sub-classifiers in the one-vs-all multi-classifier indicate a positive result for the current sample being tested. In [6], the researcher proposed a solution for overlap based on the maximum average similarity. This solution utilizes the average similarity between the sample being tested and all training samples of a specific type as the criterion for determining its belonging to that type. The label of the sample being tested is determined by selecting the type with the highest degree of belonging. Such an approach is effective in most cases; however, there are limitations. The overlap solution in [6] has limited applicability in datasets composed of massive training data, where its computational overhead will significantly increase as the number of training samples grows. Whenever the overlap phenomenon occurs, the cosine similarity between the sample being tested and all the training samples is calculated, which implies that if researchers aim to enhance the accuracy of the model by using more training samples, they must be willing to accept a higher cost associated with overlapping solutions.

Based on the results of the experimental analysis above, it is evident that the deepening of traffic homogeneity significantly reduces the ability of IoT traffic fingerprints based on traditional flow attribute features to identify 5G IoT terminals. In addition, the deepening of the degree of homogeneity also makes the one-vs-all multi-classifiers have a higher probability of encountering overlap, while the existing overlap solution based on the maximum similarity has the disadvantage of poor adaptability to datasets composed of massive samples. Thus, how to provide more distinguishing features for the construction of 5G IoT terminal traffic fingerprints and the optimization of the overlap solution of one-vs-all IoT traffic fingerprint classification models are identified as the target issue of this study.

3. Proposed Method

Traffic fingerprints based on traditional traffic features have limited IoT device recognition performance in homogeneous traffic such as 5G. Therefore, we propose a new traffic fingerprint based on the packet payload length transition pattern. In the following, we will cover the proposed traffic fingerprint in detail. Additionally, a density-center-based overlap solution is used to optimize the efficiency of the one-vs-all classifier on traffic fingerprint classification tasks. The related details are also covered in this section.

3.1. IoT Traffic Fingerprint Based on Payload Transition Patterns

In research on network traffic fingerprints of IoT devices in traditional scenarios, the extraction of network traffic features mostly begins with attribute information and statistical values derived from that information, such as packet length or IAT (interval of arrival time). Such features are still coarse-grained and insufficient to fully describe the latent patterns manifested in network sessions. Take the car entertainment app and the smartphone app, which both request video resources from the same site, as an example. The payload length of the packet containing the requested resources is large, while the packet payload when idle is small. In the sessions of the two apps, the average and variance of packet payload length are very similar, but the latent patterns are clearly distinct. The network access point of the car app changes as the vehicle moves. Therefore, the process of requesting video is intermittent. After receiving a small number of packets containing the requested resources, there will be a long idle due to access point switching, as shown in Figure 2. The network connection of the mobile app is more stable, allowing for a continuous request process and the ability to receive multiple packets containing the requested resources at shorter time intervals. In this example, relying only on the payload length and its statistical value cannot provide a contextual association between packets in the session. However, this contextual association is crucial as it can reflect the significant differences between the access devices.

From the above analysis, it is evident that expressing information about specific latent patterns found in the network sessions of IoT devices solely based on network traffic attributes and their statistical value is challenging. However, the latent pattern is exactly the distinctive feature that is less affected by homogenization. Therefore, the key to achieving accurate IoT terminal network traffic fingerprinting is to establish a correlation between the observable attributes of the network session and the latent patterns. In this study, the correlation is constructed from the perspective of the transition of packet payload length.

The network session can be viewed as a sequence composed of multiple packets. Each packet corresponds to a specific “semantics”. The more complex the “semantics”, the longer the payload length required. Packets with similar “semantics” may have similar payload length and entropy. Based on this motivation, a feature that reflects the “semantic” context by describing the packet payload length transition is proposed. As shown in Figure 3, the payload length of each packet in a session is counted as the sequence of the payload length. Convert the sequence of the payload length to the sequence of payload marks according to Table 1. For example, if a packet has a payload length of 99 bytes, its mark will be 1. Then, a matrix is initialized, where row i and column j represent the number of times the transition from mark i to mark j occurs in this session. Finally, the feature representing the payload length transition is obtained by expanding the above matrix by row.

It should be noted that the proposed feature based on packet payload length transition patterns should be used in conjunction with traditional attribute information-based features as a complement to the traditional IoT terminal traffic fingerprints. This will enhance the accuracy of terminal traffic fingerprint identification in homogeneous IoT traffic. Finally, we have determined the network traffic features to adopt, which are shown in Table 2.

3.2. Overlap Solution Based on Density Center

The effectiveness of the one-vs-all classifier in IoT traffic fingerprint identification has been proven. The most straightforward way to improve the performance is by training with more data. However, as the training data increases, the phenomenon of overlap becomes inevitable, which means that multiple sub-classifiers may report a positive result for the same sample being tested. Researchers proposed a method to address overlap [6], the essence of which is to measure the similarity between the sample being tested and each known type. According to this method, the average cosine similarity between the sample being tested and all samples of a certain known type is used as the measure of similarity between the sample being tested and the known type. The more training samples are used, the greater the computational complexity. In addition, we found that the training data from homogeneous IoT traffic is often non-convex, and it is necessary to improve the availability of overlap solutions in this case.

In light of the issues that arise when using maximum average similarity to address overlap, analysis and improvement efforts are being undertaken. Within non-convex datasets, training data from the same label may contain multiple sub-concepts and its sample points will also be dispersed around multiple “density centers” in the feature space. These “density centers” can be calculated using density-based clustering algorithms. The higher the similarity between a testing sample and a certain type, the greater the variance of the angles between the vectors in the cluster composed of the testing sample and the “density centers” of that type. Inspired by this, we proposed an overlap solution based on the density center to calculate the degree of belonging of the test sample to each known type, and the calculation method that the density center came from [14].

As shown in Algorithm 1, before training the one-vs-all multi-classifier, a density-based clustering algorithm [14] is operated to obtain the density center for each kind of training data. When an overlap occurs in the testing phase, the degree of belonging of the testing sample to a specific type is expressed as the variance of the cosine similarity between the vectors from the cluster composed of the testing sample and the density center of the current type. The larger the value of this variance, the higher the similarity between the test sample and the specified type and the type with the highest belonging is used as the label of the test sample.

Algorithm 1 Overlap Solution Using Density Center

$n$ : the number of labels in the training set
$S_{i}$ : all samples of the $i_{t h}$ label in the training set
$f$ : the overlap sample to be tested
${dc}_{i}^{j}$ : the $j_{t h}$ density center of all samples with the $i_{t h}$ label calculated by density-based cluster algorithm
${BD}_{i}$ : the belonging degree of current test sample to the $i_{t h}$ label

PREPARATION
$for$ i in $n :$
${{dc}_{i}^{1}, {dc}_{i}^{2}, \dots, {dc}_{i}^{j},}$ ← $c l u s t e r a l g o r i t h m (S_{i})$

OVERLAP SOLUTION $(f)$
$B D S$ ← $[]$
$for$ i in $n :$
$V_{i}$ ← ${〈 d c_{i}^{1}, f 〉, 〈 d c_{i}^{2}, f 〉, \dots, 〈 d c_{i}^{j}, f 〉}$
$V S_{i}$ ← the cosine similarity of vectors within $V S_{i}$
${BD}_{i}$ ← $v a r i a n c e (V S_{i})$
append ${DB}_{i}$ to $B D S$
predictedlabel ← $a r g m a x (B D S)$

4. Evaluation

In this section, real 5G IoT terminal traffic from the Shenzhen Power Supply Bureau was used for evaluation experiments. In this dataset, the proposed IoT terminal traffic based on packet payload transition patterns was compared with other traditional methods based on traffic attribute features, and the proposed overlap solution based on the density center was also evaluated in detail.

4.1. Experimental Setup and Datasets

The data used came from the uplink traffic of energy metering terminals connected via the 5G network in the wireless access area of the Shenzhen Power Supply Bureau. This dataset contains the uplink session traffic between three types of metering terminals, TTU, LMT, and LVMR, and the master station from 1 March 2021 to 7 March 2021, of which the specific functions of these metering terminals and the quantity distribution of available samples are as shown in Table 3. It should be noted that due to the management requirements of the Shenzhen Power Supply Bureau, we are not allowed to disclose the original

. P c a p

files, but the extracted features can be used for academic and other non-commercial purposes. Interested researchers are welcome to contact us via email.

Obviously, the number of samples of the LMT terminal is significantly higher than that of the other two types. If the weighted average method is used to calculate the precision and recall of multi-classification results, it is easily affected by the majority class. In view of the limited uneven distribution of samples in the 5G power grid terminal uplink traffic dataset, each device type should be treated equally. Therefore, in the experiments of this study, macro-avg precision and macro-avg recall are used to evaluate the proposed traffic fingerprint and overlap solutions.

4.2. Evaluation of Typical IoT Traffic Fingerprint Based on Attribute Features

The literature [3,4,5,6] are four typical IoT terminal traffic fingerprints that have been proposed in recent years. In this section, they are recorded as Literature A [6], Literature B [4], Literature C [5], and Literature D [3]. These four methods are reproduced in the IoT Sentinel Dataset and 5G IoT terminal traffic data from the Shenzhen Power Supply Bureau, and the device identification performance of the same traffic fingerprint in traditional IoT and 5G IoT is compared. In order to ensure the fairness and objectivity of the experiment, all experiments are conducted on three classifiers: Decision Tree/Random Forest/Adaboost. Classifiers of tree structure are more adaptable to unbalanced distributed datasets and are more in line with the 5G IoT traffic datasets. To ensure credibility, all results are averaged from 30 experiments.

According to the results obtained, the performance of the same traffic fingerprint in the traditional scenario (IoT Sentinel Dataset) is better than that in the 5G scenario (5G IoT Terminal Dataset). For example, as shown in Figure 4, when the traffic fingerprint in Literature A uses the Decision Tree as the classifier, its Macro-Avg Precision and Macro-Avg Recall in the traditional scenario are nearly 20% higher than those in the 5G scenario. Of course, there are also some cases where the performance difference between the two scenarios is relatively small. For example, when using the Adaboost as the classifier, the Macro-Avg Recall of the traffic fingerprint in Literature D in the 5G scenario is very close to that in the traditional scenario, but at this time, its Macro-Avg Precision in 5G scenarios is obviously inferior to that in traditional scenarios, as shown in Figure 5.

Furthermore, the results of the four classical IoT traffic fingerprints in real 5G IoT terminal traffic were analyzed in detail. Among the existing methods, those that can perform relatively well in 5G IoT terminal traffic were selected as the baseline for subsequent experiments. In analyzing 5G IoT terminal traffic, the Decision Tree classifier showed that the traffic fingerprints from Literature A and Literature C outperformed the other two. This trend continued even when using the Random Forest, as shown in Figure 6, although Literature B had a better Macro-Avg Recall in real 5G IoT terminal traffic than Literature C. However, Literature B’s disadvantage in Macro-Avg Precision outweighed this advantage. Adaboost classifiers showed slightly inferior results for the traffic fingerprint from Literature C compared to Literature B. However, this difference was not observed when using the Decision Tree and Random Forest as classifiers. Therefore, the traffic fingerprints from Literature C and Literature A were chosen as the baseline for subsequent experiments to evaluate the proposed traffic fingerprint.

4.3. Evaluation of the Proposed IoT Traffic Fingerprint

According to the experimental results and analysis in the previous subsection, the IoT traffic fingerprints from Literature A [6] and Literature C [5] are more prominent and stable in 5G IoT terminal traffic, so they are used as baselines here and recorded as Feature Space A and Feature Space B, respectively, and the features of the traffic fingerprint proposed in this study is recorded as the Proposed Feature Space. The training set and the test set are divided into a ratio of 7:3, and the 5G IoT terminal network traffic is mapped to these three feature spaces respectively, and the same classifier model is used for each feature space. In order to enhance the credibility of the results, the comparative experiments are conducted under different session sampling lengths, where N represents the first N packets sampled from each 5G IoT terminal session in order to build features. All evaluation indicators are average values calculated after 30 rounds of repeated experiments.

It should be noted that the selection of classifiers for evaluating feature quality is well founded. Tree-structured classifiers are highly practical and more suitable for datasets where the label distribution is not completely balanced, which is consistent with the distribution of the dataset used in this study to a certain extent. Therefore, Decision Tree, Random Forest, and AdaBoost are used as classification models to evaluate the quality of the features from proposed IoT terminal network traffic.

There are indeed differences in the performance of the same classifier in different feature spaces, which is clearly reflected in the results of the three selected classifiers. For each selected classifier, the performance curve representing the Proposed Feature is always above the performance curve representing the baselines.

As shown in Figure 7, as the number of packets sampled in each session increases, the Decision Tree achieved relatively stable performance in the Proposed Feature Space, and its Macro-Avg Precision and Macro-Avg Recall are both stable at around 80%. Moreover, the fluctuations of these two indicators are not obvious as N increases. Overall, as the value of N increases, the performance of Decision Tree, Random Forest, and Adaboost in Feature Space B improves slightly, which may be because there are more feature entries in Feature Space B that directly depend on the attribute features of uplink and downlink packets, and a longer sampling length can provide more available information.

As shown in Figure 8, the Macro-Avg Precision of Random Forest in the Proposed Feature Space can be maintained at around 90% under different sampling lengths, and its Macro-Avg Recall can also be stabilized at around 80%, which is significantly higher than that of Decision Tree in the Proposed Feature Space.

The performance of Adaboost in the three feature spaces is shown in Figure 9. The performance curve representing the Proposed Feature Space is also obviously above the performance curves corresponding to the other two feature spaces, as shown by the Decision Tree and Random Forest. It is worth noting that when the value of N increases from 40 to 45, the performance of Adaboost in Feature Space A drops significantly, and its Macro-Avg Precision and Macro-Avg Recall are significantly reduced. This phenomenon is also reflected when using the Decision Tree and Random Forest as classifiers, which may be because the feature entries in Feature Space A have a greater correlation with the statistical values of the packet sequence attribute features, while the uplink packets and downlink packets of the 5G power grid terminal are unevenly distributed, and the zero padding method used to align the sampling sequence introduced errors.

Based on the experimental results of this section, it can be concluded that the same classifier model can achieve a more accurate classification of real 5G IoT terminals in our proposed feature space, which proves the effectiveness of the proposed features based on packet payload transition patterns. It is worth noting that the performance of Random Forest in the Proposed Feature Space, Feature Space A and Feature Space B, is significantly better than the other classifiers. This is why we choose to use Random Forest as the base classifier to build one-vs-all multi-classifiers to complete the evaluation of the proposed overlap solution.

4.4. Evaluation of Overlap Solution Based on the Density Center

It is a fair and scientific approach to compare different overlap solutions in the same one-vs-all multi-classifiers. According to the experimental results in the previous subsection, Random Forest has good performance in multiple feature spaces, so it is used as the base classifier of the one-vs-all model to evaluate the proposed overlap solution. For convenience of description, the names of different overlap solutions are abbreviated. The overlap solution from [6] uses the average similarity between the testing sample and all samples of each label as the attribute degree, so it is recorded as MM-OvA-Random Forest (Max–Mean similarity-based overlap solution), and in the overlap solution proposed in this study, the intra-cluster vector angle variance of the cluster, composed of the testing sample and the density center of each label, is used as the degree of belonging, so it is recorded as DC-OvA-Random Forest (Density-Center-based overlap solution).

The performance of the two overlap solutions at different sampling lengths is compared, and the results are shown in Figure 10. The performance of the DC-OvA-Random Forest is slightly better than that of the MM-OvA-Random Forest, which is reflected in the cases of N = 20, 25, 30, 35, etc. However, when N = 5, the Macro-Avg Precision of the DC-OvA-Random Forest is lower than the MM-OvA-Random Forest. When N = 40 and 45, the performance of the two methods tends to be very close. It should be noted that the base sub-classifier used by the two schemes is the same, but the overlap solution is different. Therefore, the number of overlaps encountered in the two schemes is the same, and the speed of prediction on the test set can reflect the computational overhead of the overlap solution.

The time overhead of two overlap solutions to complete prediction in three feature spaces is calculated and recorded, as shown in Table 4. Obviously, when dealing with the same number of overlaps, the calculation speed of the DC-OvA-Random Forest is much faster. This is easily explained because it involves fewer vectors in the calculation of belongingness and naturally has a faster calculation speed.

In summary, it is credible that the DC-OvA-Random Forest has a classification performance that is not weaker than the MM-OvA-Random Forest and shows faster calculation speed in all three feature spaces, which proved that the participation of the density center in the belonging calculation can obtain a better balance between computing efficiency and performance.

5. Discussion

In this study, we proposed a traffic fingerprint based on packet payload transition which can be used in homogeneous IoT traffic where it does not simply use the attribute information of the network session directly as the feature, but further characterizes the transition patterns in the packet payload from sessions of different IoT terminals. Compared with traditional traffic fingerprints, the method proposed in this study has better terminal-type differentiation capabilities in scenarios such as 5G IoT where network traffic is highly homogeneous.

In addition, in view of the shortcomings of poor applicability of sparse data distribution and high computational overhead in the existing overlap solution, we proposed a fast sample belongingness calculation method, based on the density center, which can be used as an overlap solution for one-vs-all multi-classifiers. The density centers of each known type are calculated before training. When an overlap occurs, the variance of the angle in the cluster composed of the overlapping sample and the density centers of the current type is regarded as its degree of belonging to the current type. The type with the highest degree of belonging is the label of the overlapping sample.

According to the experimental results, in the 5G IoT terminal (TTU, LMT, and LVMR) identification task, the Macro-Avg Precision of the tree structure-based classifier using the traffic features from this study reached 92%, and the Macro-Avg Recall reached 85%, of which this result is significantly better than the baseline. On the premise that the parameter settings remain unchanged, the replacement of input features brings significant improvement to the performance of the classifier. According to the consensus formed by researchers during the application of machine learning algorithms, that is, the quality of features determines the upper limit of the classification model, and the classifier is only as close as possible to this upper limit. We believe that the experiments in this study can prove the effectiveness of the proposed IoT traffic fingerprint based on packet payload transition patterns in homogeneous scenarios like 5G IoT. When one-vs-all multi-classifiers encounter overlap in the IoT traffic fingerprint identification task, the overlap solution based on the density center proposed in this study can better balance performance and computational overhead. Taking the performance of one-vs-all Random Forest in the feature space from this study as an example, under the same test set size, the density center-based overlap solution degrades the calculation time of the prediction process from 203.89 s to 32.85 s and maintained the recognition accuracy.

Although compared with research work in traditional scenarios, the network traffic fingerprint proposed in this study has achieved better performance in the homogeneous scenario, it still has certain limitations. Like most other studies, the IoT traffic fingerprint identification task is also regarded as a closed multi-classification problem in this study. This idea naturally assumes that the testing process does not contain unknown samples, that is, all possible samples have appeared in the training set, which is impossible. Therefore, we believe that the future development of IoT traffic fingerprints can be combined with open-set recognition and concept drift detection technologies to improve the perception capabilities of unseen samples and construct a more practical IoT fingerprint model.

Author Contributions

Conceptualization, all authors; methodology, M.F. and J.G.; software, M.F. and W.S.; validation, Y.H. and W.S.; formal analysis, Y.H. and W.S.; investigation, M.F. and W.S.; resources, Y.L.; data curation, W.S.; writing—original draft preparation, M.F.; writing—review and editing, M.F. and Y.L.; visualization, W.S. and Y.H.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Programmes of China No. 2021YFB3101900.

Data Availability Statement

The real IoT traffic data used in Section 5 is from the wireless-accessing area of the Shenzhen Power Supply Bureau and can be obtained by contacting us via e-mail.

Acknowledgments

This work was supported by the Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), and the Shenzhen Power Supply Bureau was acknowledged to provide network traffic data from real 5G IoT terminals.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pacheco, J.; Hariri, S. IoT Security Framework for Smart Cyber Infrastructures. In Proceedings of the 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W), Augsburg, Germany, 12–16 September 2016. [Google Scholar]
Zhou, Y. Research on the Application of 5G Communication Technology in the Development of Computer Internet of Things. In Proceedings of the International Conference on Computers, Information Processing and Advanced Education (CIPAE), Ottawa, ON, Canada, 26–28 August 2022. [Google Scholar]
Aksoy, A.; Gunes, M.H. Automated IoT Device Identification using Network Traffic. In Proceedings of the ICC 2019—2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019. [Google Scholar]
Hamad, S.A.; Zhang, W.E.; Sheng, Q.Z.; Nepal, S. IoT Device Identification via Network-Flow Based Fingerprinting and Learning. In Proceedings of the 2019 18th IEEE International Conference On Trust, Security and Privacy in Computing And Communications/13th IEEE International Conference on Big Data Science And Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019. [Google Scholar]
Shahid, M.R.; Blanc, G.; Zhang, Z.; Debar, H. IoT Devices Recognition Through Network Traffic Analysis. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018. [Google Scholar]
Song, Y.; Huang, Q.; Yang, J.; Fan, M.; Hu, A.; Jiang, Y. IoT device fingerprinting for relieving pressure in the access control. In Proceedings of the ACM Turing Celebration Conference—China (ACM TURC’19), New York, NY, USA, 17 May 2019. [Google Scholar]
Miettinen, M.; Marchal, S.; Hafeez, I.; Asokan, N.; Sadeghi, A.-R.; Tarkoma, S. IoT SENTINEL: Automated Device-Type Identification for Security Enforcement in IoT. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017. [Google Scholar]
Lefoane, M.; Ghafir, I.; Kabir, S.; Awan, I.-U. Unsupervised Learning for Feature Selection: A Proposed Solution for Botnet Detection in 5G Networks. IEEE Trans. Ind. Inform. 2023, 19, 921–929. [Google Scholar] [CrossRef]
Jafari, H.; Omotere, O.; Adesina, D.; Wu, H.-H.; Qian, L. IoT Devices Fingerprinting Using Deep Learning. In Proceedings of the MILCOM 2018—2018 IEEE Military Communications Conference (MILCOM), Los Angeles, CA, USA, 29–31 October 2018. [Google Scholar]
Formby, D.; Srinivasan, P. Who’s in Control of Your Control System? Device Fingerprinting for Cyber-Physical Systems. In Proceedings of the NDSS’16, San Diego, CA, USA, 21–24 February 2016. [Google Scholar]
Yang, K.; Li, Q.; Sun, L. Towards automatic fingerprinting of IoT devices in the cyberspace. Comput. Netw. 2019, 148, 318–327. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hsu, C.W.; Lin, C.J. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 2002, 13, 415–425. [Google Scholar] [PubMed]
Zhang, T.; Zhou, M.; Guo, X.; Qi, L.; Abusorrah, A. A Density-Center-Based Automatic Clustering Algorithm for IoT Data Analysis. IEEE Internet Things J. 2022, 9, 24682–24694. [Google Scholar] [CrossRef]

Figure 1. Typical 5G IoT scenario architecture.

Figure 2. Resource request process of car app and mobile app.

Figure 3. IoT terminal network traffic feature based on packet payload length transition patterns. The red arrow represents a payload mark switch, and the number in the red circle represents the number of times that switch occurred.

Figure 4. Performance comparison of the same IoT traffic fingerprints in 5G and non-5G scenarios using Decision Tree as the classifier.

Figure 5. Performance comparison of the same IoT traffic fingerprints in 5G and non-5G scenarios using Adaboost as the classifier.

Figure 6. Performance comparison of the same IoT traffic fingerprints in 5G and non-5G scenarios using Random Forest as the classifier.

Figure 7. IoT terminal identification performance of Decision Tree in different IoT traffic feature space.

Figure 8. IoT terminal identification performance of Random Forest in different IoT traffic feature space.

Figure 9. IoT terminal identification performance of Adaboost in different IoT traffic feature space.

Figure 10. Performance comparison of different overlap solutions.

Table 1. Packet payload length mapping table.

The Range of Packet Payload Length	Packet Payload Length Mark
$[0, 100)$	1
$[100, 200)$	2
$[200, 300)$	3
$[300, 400)$	4
$[400, 500)$	5
$[500, 600)$	6
$[600, 700)$	7
$[700, 800)$	8
$[800, 900)$	9
$[900, 1000)$	10
$[1000, 1100)$	11
$[1100, 1200)$	12
$[1200, 1300)$	13
$[1300, 1400)$	14
$[1400, 1550)$	15

Table 2. Features used for constructing IoT traffic fingerprint.

Feature	Dimension
The representation of how packet payload length changes within a session, which reflects the deep “semantics” pattern	$15 \times 15$
The mean, variance, maximum, minimum, and sum of the packet payload length in the session	5
The payload length of first N packets in the network session	N
The interval of first N packets in the network session	$N - 1$

Table 3. Description of the traffic data from power grid metering terminals.

Label	Description	Number of Flows
LMT	Terminals used for on-site service and management, with functions such as remote meter reading, energy monitoring	10,121
TTU	Monitor and record the operating conditions of distribution transformers	838
LVMR	Receive and forward the commands from the master station to collect and control data from electric energy meters	2890

Table 4. Computational speed of different overlap solutions.

Features	MM-OvA-Random Forest	DC-OvA-Random Forest
Feature Space A	123.64 s	15.89 s
Feature Space B	165.78 s	23.43 s
Proposed Feature Space	203.89 s	32.85 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, M.; Gao, J.; He, Y.; Shi, W.; Lu, Y. Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns. Electronics 2024, 13, 930. https://doi.org/10.3390/electronics13050930

AMA Style

Fan M, Gao J, He Y, Shi W, Lu Y. Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns. Electronics. 2024; 13(5):930. https://doi.org/10.3390/electronics13050930

Chicago/Turabian Style

Fan, Mingrui, Jiaqi Gao, Yaru He, Weidong Shi, and Yueming Lu. 2024. "Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns" Electronics 13, no. 5: 930. https://doi.org/10.3390/electronics13050930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Traffic Fingerprints for Homogeneous IoT Traffic Based on Packet Payload Transition Patterns

Abstract

1. Introduction

2. Background and Problem Statement

3. Proposed Method

3.1. IoT Traffic Fingerprint Based on Payload Transition Patterns

3.2. Overlap Solution Based on Density Center

4. Evaluation

4.1. Experimental Setup and Datasets

4.2. Evaluation of Typical IoT Traffic Fingerprint Based on Attribute Features

4.3. Evaluation of the Proposed IoT Traffic Fingerprint

4.4. Evaluation of Overlap Solution Based on the Density Center

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI