1. Introduction
Fuzzing is a widely used security technique for discovering vulnerability in network protocol by sending a series of test files with random or fault data to software system implementing specific protocol and observing software exceptions to detect vulnerabilities within the protocol.
Currently, there exist mainly two kinds of fuzzing techniques, i.e., mutationbased and generationbased fuzzing [
1]. The former generates test files by injecting random or fault data into sample messages (message is the basic data unit exchanged between processes of applicationlayer protocol ), while the latter constructs faultinjected messages as test files based on specific protocol specification. The mutationbased fuzzing emits a fatal problem that too many faultinjected messages are required to maintain a high test coverage, such as FileFuzz (
http://www.securiteam.com/tools/5PP051FGUE.html) and SPIKEfile (
https://www.ee.oulu.fi/research/ouspg/SPIKEfile). However, the amount of faultinjected files is
${256}^{L}$, where
L is the power of sample message’s length, and it would take tremendously long time to handle so great amount of faultinjected test files especially when
L is large. Actually, a protocol’s software system parses inputs by considering their formats and treats any files which does not obey the rule of its format as invalid input, in which case a software system will throw an error and quit before it reaches the fault segment(s). Therefore, many of faultinjected test files are not necessary for successful fuzzing test. The generationbased fuzzing generates test files by considering the format of input messages, such as PROTOS (
https://www.ee.oulu.fi/roles/ouspg/Protos). One advantage of such fuzzing tools is that it reduces the number of test files greatly and introduces nearly no sacrifice on test coverage [
2]. However, one has to figure out the message formats and configure generationbased fuzzer accordingly. Currently, the message formats are mainly collected or analyzed in a manual way, which is a timeconsuming and errorprone process. To address these issues, protocol reverse engineering [
3] is introduced to obtain protocol specification automatically. The protocol specification including message format a set of rules that describe or model a network protocol. Then a fieldbased faultinjected message generation procedure conducted by the message format is applied to create fuzzing test files.
Protocol message is treated as a byte sequence which could be divided into a sequence of fields. A keyword field usually holds a command, operator or state code of protocol, while a data field is variable subsequence whose content is always changeable, such as the value of some parameters of communication. Generally, message format is recovered by identifying all fields in byte sequences. However, it is hard to locate the boundary of fields and a great challenge to identify fields in message, since a priori information about them is usually not available. The byte sequence of protocol message is supposed to obey an underlying stochastic process in which different fields have their own distribution of symbols and changepoints are the boundaries of fields. Apparently, each changepoint implies an end point of one field and a start point of another field. With these assumption, our goal of field boundary detection is essentially the problem of multichangepoint detection. This problem can be addressed using changepoint detection [
4] widely used in time series analysis. When changepoints are localized successfully, messages are divided into field sequences. However, the type of fields are still uncertain. Thus, a further inference procedure, named positionbased occurrence probability test analysis, is proposed to determine field type( keyword fields and data fields). Firstly, fields with approximate zeroprobability distribution are classified as data fields. Then, the rest ones are further processed in a positionbased statistic test. Specifically, a reference position would be selected for every field, and each field are tested by binomial test to make sure whether their positions are equal to the reference position with probability 1 given a significance level
$\alpha $. The fields passing these tests are chosen as keyword fields, while the rest ones are considered as uncertain fields.
2. Related Work
Recently, the security and privacy issues for Internet of Things have attracted a lot of research interests [
5,
6,
7,
8,
9,
10]. In particular, the analysis of applications and protocols in realtime network traffic monitoring is a fundamental and critical building block in network management and security systems for IoT infrastructures [
11,
12,
13,
14,
15]. In this part, we review the recent works in applicationlayer network protocol vulnerability analysis and detection.
Fuzzing helps protocol vulnerabilities detection to gain higher benefittocost ratio with no or less increasing in computing complexity. It aims to reveal bugs in protocols which would be exploited by adversary to launch attack or activate their malicious code. Currently, research on network protocol fuzzing test is a heat topic in network security. AutoFuzz [
16] identified the variable parts of sample messages and fuzzes protocol implementation by sending messages with invalid symbols or messages. AspFuzz [
17] leveraged the accessible protocol specifications on RFCs (Request for Comments) to generate faultinjected messages for test files. Then, AspFuzz sent both anomalous and reordered messages to discover vulnerabilities. SecFuzz [
18] focused on fuzzing security protocol implementation, but it did not consider the specification of target protocol as well. Zhao et al. [
19] used regression finite state machine to infer a state transition diagram of protocol so as to reveal potential vulnerabilities in wireless protocols.
In recent years, a range of works about protocol reverse engineering [
20] have been published. Early in 2005, Marshall A. Beddoe held the protocol informatics project [
21] and applied bioinformatics algorithms to identify the fields in packets based on alignment algorithms. Cui et al. went further than Beddoe and presented Discoverer [
22] to recover protocol message format using both sequence aligning and recursively clustering algorithm. However, Discoverer need some a priori information about the delimiters used by protocol, such as space and comma, which is used to help tokenization, i.e., breaking message into token sequence. Recently, Tao et al. [
23] combines hierarchical clustering algorithm, multisequence alignment and Bayesian decision model to determine the field boundary of binary protocol in bit granularity. Chen et al. [
24] introduce deep learning algorithm to analyze mobile applications. Xiao et al. [
25] propose a method based on heuristic rule to reverse analysis of the incomplete flow. In our approach, we make no assumption about the delimiters. We treat the byte sequence of message as a stochastic process and detect field boundaries according to their statistical properties.
As a paralleled method to understand the unknown protocols, binary analysisbased techniques, such as Polyglot [
26], Tupni [
27], AutoFormat [
28], Prospex [
29] Dispatcher [
30] and so on, also draw much research attention. They are practical in some special scenarios where binary codes are available and executable in a specific sandboxlike environment. Moreover, binary analysis method would fail if programs make use of some confusion techniques like obfuscation to keep themselves away from being reverseengineered.
As in many other security application domains [
31,
32,
33,
34,
35,
36], data mining and machine learning techniques have been widely adopted in the domain of IoT security and IoT traffic analysis. One of the key challenges is the data privacy problem, especially in collaborative and cloudbased learning scenarios. Several recent studies have proposed novel data privacy preserving approaches for addressing the problem [
37,
38,
39,
40,
41,
42].
3. Problem Formulation
Suppose that the alphabet used by protocol messages is defined as $\mathsf{\Sigma}=\left\{0\mathtt{x}00,0\mathtt{x}01,0\mathtt{x}02,\dots ,0\mathtt{x}FF\right\}$. A string $\omega $ is defined as a finite set of ordered letters in $\mathsf{\Sigma}$. That is $\omega ={a}_{1}{a}_{2}\dots {a}_{n}$ ( ${a}_{1},{a}_{2},\dots ,{a}_{n}\in \mathsf{\Sigma}$). All strings over alphabet $\mathsf{\Sigma}$ forms a super set ${\mathsf{\Sigma}}^{\ast}$. As a basic data unit used by IoT protocol, protocol message m is essentially strings made up of a sequence of message fields. Thus, we mark message field as $\varrho \in {\mathsf{\Sigma}}^{\ast}$.
In this paper, a protocol message is assumed to be a byte sequence undergoing hidden statistical process, denoted as $\mathsf{\Theta}$, whose statistical feature would shift on and on when the byte sequence goes from one message field to another. As $\mathsf{\Theta}$ passes from one field (${\varrho}_{i}$) to another (${\varrho}_{j}$), the statistical characteristic would change significantly. Thus, a changepoint would occur just in the boundary of two different message fields. Inspired by this observation, the problem of message field identification can be transformed to be a changepoint detection issue in the statistical process undergone by protocol message.
Given a string ${\omega}_{o}=\overline{{x}_{1}\dots {x}_{n}}$, a qlength prefix of the last letter (i.e., ${x}_{n}$) in ${\omega}_{o}$ is marked as $\mathfrak{T}({\omega}_{o},q)$, while the set of such prefixes whose lengths are no longer than Q in ${\omega}_{o}$ is marked as $\mathcal{T}({x}_{n},Q)$. For instance, $\mathfrak{T}(\overline{{x}_{1}\dots {x}_{4}},2)=\overline{{x}_{2}{x}_{3}}$, $\mathfrak{T}(\overline{{x}_{1}\dots {x}_{4}},3)=\overline{{x}_{1}{x}_{2}{x}_{3}}$, and $\mathcal{T}({\omega}_{o},Q)=\{\mathfrak{T}({\omega}_{o},q):1\le q\le min(Q,n1),Q\in \mathbb{R}\}$.
The prefix conditional probability of
$\overline{{x}_{1}\dots {x}_{n}}$ is defined as
Let
$m=\overline{{x}_{1}{x}_{2}{x}_{3}\dots}$ to be a
Qorder Markov process. Then, the likelihood of
${x}_{n}$ given
$\overline{{x}_{1},\dots ,{x}_{n1}}$ is
where
$Q\in \mathbb{R}$ and
$n>Q$.
Suppose that the byte sequence of protocol message obeys
Qorder Markov process, then Equation (
1) would be rewritten as follows.
where
${\omega}_{q}$ is the weight of
$P({x}_{n}\mathfrak{T}(\overline{{x}_{1}\dots {x}_{n}},q))$. Essentially,
${\omega}_{q}$ can be regarded as the importance of
$\mathfrak{T}(\overline{{x}_{1}\dots {x}_{n}},q)$ for predicting the context of
${x}_{n}$.
The larger
q is, the more important it is for
$\mathfrak{T}(\overline{{x}_{1}\dots {x}_{n}},q)$ in predicting the context of
${x}_{n}$. For instance, it is much more important for
$P(\u201ce\u201d\u201cxampl\u201d)$ than
$P(\u201ce\u201d\u201cpl\u201d)$ to foresee that the context of “e” is “example” instead of “multiple”. As a result, the weight of
${\omega}_{q}$ in this paper is defined as
Additionally,
$P({x}_{n}\mathfrak{T}(\overline{{x}_{1}\dots {x}_{n}},q))$ is calculated by
where
$\nu (\omega )$ is the frequency of
$\omega $ in training dataset
$\mathcal{D}$.
As shown in
Figure 1, the prefix conditional probability of
$\overline{{x}_{1}\dots {x}_{n}}$ would be very high when
${x}_{n}$ and
$\mathfrak{T}({x}_{n},q)$ locate in the same field, otherwise it would be low.
3.1. Minmax Formulation for Field Detection
There exist mainly two formulations of changepoint detecting problem: Bayesian formulation and minmax formulation. The Bayesian formulation [
43] assumes that the changepoint
$\gamma $ obeys a prior distribution which is known in prior, while the minmax formulation [
44] supposes that the changepoint as well as its statistical distribution are unknown to us.
In this paper, the statistical distribution of changepoints in protocol message is unknown. As a result, the changepoint detection problem should be represented in minmax formulation. Page [
45] proposed a cumulative sum (CUSUM) algorithm to implement an optimal solution to minmax formulated problems. Accordingly, a CUSUMLIKE algorithm is proposed to search fro multiple changepoints in this paper. Since the statistic feature of message fields is unknown in prior, the likelihood ratio from postchange probability to prechange probability, denoted as
$L({X}_{n})$ cannot be calculated directly by
$L({X}_{n})={f}_{{\varrho}_{1}}({X}_{n})/{f}_{{\varrho}_{0}}({X}_{n})$. Thus,
$L({X}_{n})$ is replaced with a new metric in this paper as
Suppose
$\gamma $ is a changepoint in a message and
${x}_{n}$ is the
nth letter in the message. We assume that the postchange distribution of
${x}_{n}$ is
${f}_{{\varrho}_{1}}({X}_{n})$, while the prechange distribution of
${x}_{n}$ is
${f}_{{\varrho}_{0}}({X}_{n})$, then prefix conditional probability of
${x}_{n}$, i.e.,
${p}_{n}$, would be much less than
${p}_{n1}$, which results in a high and positive value of
${C}_{n}$. When
$n<\gamma $, if
${x}_{n}$ and
${x}_{n1}$ locate in the same field, that is they obey the same distribution, so that
$1{p}_{n}/{p}_{n1}\le \u03f5$, where
$\u03f5$ is a small and positive value, given as a threshold. if
${x}_{n}$ and
${x}_{n1}$ locate in different fields, that means
$n1$ is also a changepoint which should be detected before
$\gamma $. On the other hand, the value of
${C}_{n}$ is likely to be bigger than the given threshold
$\u03f5$ when
$n>\gamma $. As a result, a detection indicator metric which could be regarded recursively for multichangepoint detection should be defined as:
The stopping condition can be set as
where
$\upsilon $ is a threshold of detection indicator.
3.2. MultiChangePoint Detection
Since the problem of message field identification in this paper is actually a multichangepoint detection problem, the detection procedure has to be extended to a multiround procedure presented in
Section 3.1 and called MultiCUSUM.
A variable
${\chi}_{n}$ indicating the underlying state of
${x}_{n}$ is defined as
Accordingly, the detection statistic is
where
${u}_{0}$ is the initial condition in a new round of detection procedure started once the previous changepoint has been found.
The stopping time in the
kth iteration, denoted as
$\tau}_{k}^{\ast$, is defined as
where
with
${\mu}_{k}$ as the mean of
$\{{C}_{{\displaystyle {\tau}_{k1}^{\ast}}+1},\dots ,{C}_{n1}\}$ and
$\rho $ as the coefficient of
${\mu}_{k}$.
3.3. Message Segmenting Algorithm
A message segmenting algorithm, as shown in Algorithm 1, is proposed to segment protocol message
m into a set of message fields. In Algorithm 1, the message
m consists of a set of All messages associated with a specific protocol in
$\mathcal{D}$ are concatenated one by one to form a new message
m according to their appearance time. Then, a
Qdepth suffix trie
$\mathbf{T}$ is built to store substrings of
m with max length of
$Q+1$ (line 1). The prefix conditional probability
${p}_{n}$ is calculated according to Equation (
3) (line 2) to enable the multichangepoint detection procedures (MultiCUSUM()). The identified changepoints are put into
${\mathcal{P}}_{1}$ (line 3).
Algorithm 1 Message Segmenting Algorithm 
Input: Message $m=\overline{{x}_{1}\dots {x}_{N}}$ Output: Segment set $\mathsf{\Omega}$ 1:
$\mathbf{T}\leftarrow $ QSufTrie(m); # Creating Qdepth suffix trie  2:
$\mathbf{P}\leftarrow $ condProb(m,$\mathbf{T}$); # Compute the conditional probabilities: $\mathbf{P}=\{{p}_{n}:n=1,\dots ,N\}$  3:
${\mathcal{P}}_{1}\leftarrow $ MultiCUSUM(m,$\mathbf{P}$); # Changepoint detection, ${\mathcal{P}}_{1}$ is the changepoint set  4:
${m}^{R}\leftarrow $$\overline{{x}_{N}{x}_{N1}...{x}_{1}}$; # Reverse the message stream  5:
${\mathbf{T}}^{R}\leftarrow $ QSufTrie(${m}^{R}$);  6:
${\mathbf{P}}^{R}\leftarrow $ condProb(${m}^{R}$,${\mathbf{T}}^{R}$);  7:
${\mathcal{P}}_{2}\leftarrow $ MultiCUSUM(${m}^{R}$,${\mathbf{P}}^{R}$);  8:
$\mathcal{P}\leftarrow {\mathcal{P}}_{1}\cup {\mathcal{P}}_{2}$  9:
$\mathsf{\Omega}\leftarrow $ MsgSeg(m,$\mathcal{P}$);

Actually, not all changepoints are not so sensitive to the prefix conditional probability of $P({x}_{n}\overline{{x}_{1}\dots {x}_{n1}})$ to be detected by the aforementioned procedure, instead they are more sensitive to the postfix conditional probability of $P({x}_{n}\overline{{x}_{n+1}\dots {x}_{N}})$ which is essentially the prefix conditional probability $xn$ in a special string that is the reverseorder of original message. Therefore, we reverse the letter order of m (i.e., ${m}^{R}={x}_{N}{x}_{N1}\dots {x}_{1}$) and perform the same detection procedure again on ${m}^{R}$ to search for such type of changepoints (line 4∼7) and put the results in ${\mathcal{P}}_{2}$.
Finally, the two sets of changepoints are merged by $\mathcal{P}={\mathcal{P}}_{1}\cup {\mathcal{P}}_{2}$ and the message m is segmented into segments based on the changepoints in $\mathcal{P}$(line 8∼9).
4. Inferring Message Fields
4.1. Occurrence Probability Analysis
To relief the burden of positionbased statistic test analysis, a preprocessing called occurrence probability analysis is applied to filter out the obvious part of data fields whose occurrence probability is very low. Given a dataset $\mathcal{D}$ and its size of M, and the occurrence probability of a string $\omega \in \mathsf{\Omega}$ in $\mathcal{D}$, denoted as ${p}_{D}(\omega )$, is defined as the ratio between the amount of messages containing $\omega $, denoted as ${\nu}_{m}(\omega )$, and the size of dataset.
The data field is variational and their occurrence probabilities of each value in a data field are always very small, which nearly approaches zero. Therefore, the data field can be found by searching for those string segments whose occurrence probabilities are statistically zero. In this paper, the occurrence probabilities of message segments is assumed to obey binomial distribution and the binomial test in the statistics field is considered to test whether the occurrence probability of each message segment is zero.
Let the hypothesis be
where
$\alpha $ is a significance level.
The strings in
$\mathcal{F}$ could be chosen as data fields according to
4.2. PositionBased Statistic Test Analysis
Apparently, a keyword field would frequently appear in many messages with similar function and its positions are also relatively stable. That means both frequency and position are important features for us to infer keyword fields from segment set $\mathcal{F}$. As a result, a positionbased statistic test is introduced to select keyword fields from $\{\omega :\omega \in (\mathcal{F}{\mathcal{F}}_{d})\}$ by testing the position of segment is fixed or quasifixed in messages.
Specifically, four kinds of positions of $\omega $ are considered in our scheme. That is
${P}^{\omega ,1}$: the distance between the message head and the position of $\omega $ in the message.
${P}^{\omega ,2}$: the distance between the message tail and the position of $\omega $ in the message.
${P}^{\omega ,3}$: the distance between the head of a line which containing $\omega $ and the position of $\omega $ in that line.
${P}^{\omega ,4}$: the distance between the tail of a line which containing $\omega $ and the position of $\omega $ in that line.
Let
${P}^{\omega ,r}=({\displaystyle {p}_{1}^{\omega ,r}},\dots ,{\displaystyle {p}_{n}^{\omega ,r}}),r\in \{1,2,3,4\}$ and define the support rate of
$p}_{i}^{\omega ,r$, marked as
$N({\displaystyle {p}_{i}^{\omega ,r}})$, as the number of
$p}_{i}^{\omega ,r$ in
$\mathcal{D}$. Based on binomial test (see
Section 4.1), the keyword fields are chosen by
given
$\alpha $ as the significance level.
Equation (
15) infers keywords whose positions are fixed by searching for segments satisfying
${max}_{i,r}\{N({\displaystyle {p}_{i}^{\omega ,r}})\}$. It has good performance on those
$\omega $ which have one dominated position. For instance, “GET” in HTTP messages has one dominated position, i.e., in the head of a request message. However, some other keywords have more than one dominated position, and there are multiple peaks in
$N({\displaystyle {p}_{i}^{\omega ,r}})$.
Aiming to address multipeak issue, an algorithm (called MDLPTA) based on the minimal description length (MDL) [
46] criteria is introduced to enable the positionbased statistic test analysis, as shown in Algorithm 2.
k reference positions, ${B}_{k}=\{{b}_{1},\dots ,{b}_{k}\}$, whose support rates are the first k top values in $\{N({\displaystyle {p}_{j}^{\omega ,r}}):{\displaystyle {p}_{j}^{\omega ,r}}\in {P}^{\omega ,r}\}$ (line 5) are selected for each $\omega $, and ${P}^{\omega ,r}$ is divided into k clusters, ${C}_{k}=\{c({b}_{1}),\dots ,c({b}_{k})\}$, according to the distance between $p}_{j}^{\omega ,r$ and reference position ${b}_{m}$, $m=1,2,\dots ,k$ (line 6).
The entropy of
${C}_{k}$ is calculated through following equation:
The model complexity of
${C}_{k}$ is
$(logk)/2$ and the sum of description length of
${C}_{k}$ is calculated in line 7, that is
The kth model in the model set $\mathsf{\Psi}$ is represented as $\{{B}_{k},{C}_{k},{L}_{k}\}$. The optimal model with minimal description length would be selected from $\mathsf{\Psi}$ (line 11). Apparently, the computation complexity would be very high if all models in $\mathsf{\Psi}$ are considered. Meanwhile, a keyword should not have lots of reference positions. As a result, only the top K models in $\mathsf{\Psi}$ are considered in Algorithm 2 (line 4∼10).
Algorithm 2 MDLPTA Algorithm 
Input:K, $\mathcal{D}$ and $\omega \in (\mathcal{F}{\mathcal{F}}_{d})$ Output: true if $\omega $ is a keyword field, or false otherwise. 1:
$\mathsf{\Psi}\leftarrow \{\}$  2:
${P}^{\omega ,r}\leftarrow $ GetPos($\mathcal{D}$,$\omega $) # Get positions of $\omega $  3:
$\mathcal{N}\leftarrow \{N({\displaystyle {p}_{j}^{\omega ,r}}):{\displaystyle {p}_{j}^{\omega ,r}}\in {P}^{\omega ,r}\}$  4:
for i = 1 K do  5:
${B}_{i}\leftarrow $ TopK($\mathcal{N}$,i) # Get reference positions  6:
${C}_{i}\leftarrow $ Cluster(${P}^{\omega ,r}$,${B}_{i}$) # Cluster ${P}^{\omega ,r}$  7:
${L}_{i}\leftarrow $ CalDLen(${C}_{i}$) # Compute the description length of ${C}_{i}$  8:
${\psi}_{i}\leftarrow \{{B}_{i},{C}_{i},{L}_{i}\}$  9:
$\mathsf{\Psi}\leftarrow \mathsf{\Psi}\cup \{{\psi}_{i}\}$  10:
end for  11:
${\psi}^{\ast}\leftarrow $ minDLen($\mathsf{\Psi}$) # Get the model with minimal description length  12:
for all${b}_{i}\in {B}^{\ast}$do  13:
res←TestPos( ${b}_{i}$, ${c}^{\ast}({b}_{i})$) # Check whether ${b}_{i}$ satisfies Eq. ( 19)  14:
if res==true then  15:
return true  16:
end if  17:
end for  18:
return false

The optimal model chosen by the Algorithm 2 is
${\psi}^{\ast}=\{{C}^{\ast},{B}^{\ast},{L}^{\ast}\}$. For each reference position
${b}_{i}\in {B}^{\ast}$ in
${C}^{\ast}=\{{c}^{\ast}({b}_{1}),\dots ,{c}^{\ast}({b}_{{k}^{\ast}})\}$, the following hypothesis is tested via binomial test:
where
${b}_{i}\in {B}^{\ast}$ and
$i=1,\dots ,{k}^{\ast}$.
The segment set passing hypothesis test is regarded as keyword set ${\mathcal{F}}_{k}$, and the rest ones are uncertain fields.
The inferred message fields would be further refined and some semantic information of message fields would be determined. Specifically, continuous segments of data (or uncertain) fields would be merged into a single segment which is data (or uncertain) field. Regular expressions representing some specific semantic information, such as IP address, File names, URLs, Timestamp and so on, are applied to match the message fields so that some semantic of message fields would be inferred.
5. Evaluation
In this section, experiments are performed to evaluate the effectiveness of the proposed method. The experiments comprise of two parts: message segmentation evaluation and fuzzing test. The proposed message segmentation approach is implemented on a system called QCDPInfer whose system architecture is shown in
Figure 2.
There are totally six typical protocols (HTTP, FTP, SMTP, POP, DNS and QQ) which are widely used in the applicationlayer are selected to test the effectiveness and efficiency of message segmentation. The recall and precision of keyword inference are shown in
Table 1 and
Table 2. Please note that, the ground truth of keywords are those keywords which are occurred in the test set. Both DNS and QQ are not taken into account for evaluating the quality of keyword set, since the two are binary protocols and there is no concept of keyword defined in binary protocol.
By comparison, QCDPInfer has a higher recall rate than Discoverer and PI. In particular, PI’s recall rate is much low: the recall rates for HTTP, FTP, SMTP and POP are less than 10%.
Discoverer is prone to infer too many segments as keywords, so that its precision is much lower than that of the proposed system. Although PI’s recall rate is very low, its precision for HTTP and FTP is extremely high. However, PI’s precision for other protocols are still very low. It is worth mentioning that PI infers too few keywords, always less than 5 for all protocols being considered.
The Fscores of the experiment results are shown in
Figure 3. The proposed system has the highest Fscore for all the six protocols, which means our method performs well in keyword inference.
In fuzzing test, QCDPInfer is extended with fuzzing function to implement an automatic fuzzing tool (APREFuzz). APREFuzz can identify vulnerability in a system being tested which is designed to introduce informationcentric network into IoT devices to enable their caching capability. The protocol used by target system under testing comprises of 5 type of messages responsible for sending interesting, distributing data, pushing data, responding with target data and responding with no answer, respectively.
Firstly, message fields are identified using QCDPInfer system, and message format are reconstructed.
Secondly, test files are generated by inserting fault data into one field according to the message format. Please note that, for a real fuzz test, fault data may inserted into more than one field. However, as a proofofconcept system, APREFuzz considers the scenarios with only one field being faultinjected currently. Actually, it is not difficult to extend the system to consider faultinjected in multiple fields. When inserting the fault data, keyword fields are only replacing with inferred keywords according to message formats, data fields would be replaced by random data, while uncertain fields would be replaced with either inferred keywords or random data. In our experiments, the uncertain fields are treated the same as data fields.
Finally, the target system are treated as a black box and supposed to be unknown to us. APREFuzz sends test files to target system filebyfile and monitors the reactive of target system via analyzing the response.
In our experiments, APREFuzz extracted 7 keyword fields and infers 7 data fields in the sample message. One data field is found that it contains only figures. The amount of inferred keywords is 12. We take 11 abnormal strings into account for inserting fault data into the data fields except the one containing only figures. For the special field that containing only figures, 21 boundary figures are used to be injected. As a result, the amount of faultinjected files generated by APREFuzz is 248 (=$(12+11)\times 7+21\times 1+11\times 6$). On the other hand, FileFuzz generates $393,216$ (=$1.5\times 1024\times {2}^{8}$) faultinjected files by replacing each byte with values from $0\mathtt{x}00$ to $0\mathtt{x}FF$. When test files are sent to target system, APREFuzz monitored one exception that the system fails to respond, while FileFuzz monitor none. The exceptions maybe indicates a vulnerability which would be leveraged to launch a DoS attack, or some attacks that would ruin the system’s availability. Actually, other tools are needed to analyze the exception deeply and figure out its type and impact. However, that work has surpassed the discussion scope concerned in this paper so that it will not be presented here.
6. Conclusions
The proposed method applies protocol reverse engineering approach to improve IoT protocol fuzzing performance by creating valid and effective test files based on protocol message format and reducing greatly the size of test files. It considers the statistical attributes of message fields to locate their boundaries by searching for changepoints in the messages and reconstruct the message format. A CUSUMLIKE algorithm is presented to address the problem of multichangepoint detection. Additional procedures including occurrence probability test and position test are further employed to classify the message segments into keyword fields, data fields and uncertain fields. The results show that the extracted message formats are useful for generating test files for network protocol fuzzing.
In the future, the proposed APREFuzz with enough improvement based on current version would be a practical and powerful tool to generate test files automatically for fuzzing test carried on IoT protocols or devices to reveal their hidden vulnerabilities. It also would contribute to strengthening the IoT security in effective and efficient way, and even to be a security tool for improving protocol fuzzing in many other types of network.
Author Contributions
Conceptualization, J.Z.L. and J.C.; methodology, J.Z.L.; software, J.Z.L. and C.S.; validation, J.Z.L., J.C., Y.L. and C.S.; formal analysis, C.S.; investigation, C.S.; resources, J.C.; data curation, Y.L.; writing—original draft preparation, J.Z.L.; writing—review and editing, J.C. and C.S.; visualization, Y.L.; supervision, J.C.; project administration, J.C.; funding acquisition, J.C.
Funding
This research was funded by National Natural Science Foundation of China (Grant No.: 61702120, 61571141); Natural Science Foundation of Guangdong Province (Grant No.: 2017A030310591, 2014A030313637, 2015A030313672); Department of Education of Guangdong Province (Grant No.: YQ2015105, 2016GCZX006, 2016KQNCX091); Guangdong Provincial Applicationoriented Technical Research and Development Special fund project (Grant No.: 2015B010131017); Guangdong Science and Technology Department (Grant No.: 2016A010120010, 2014A010103032, 2017A090905023); Science and Technology Program of Guangzhou (Grant No.: 201604016108).
Acknowledgments
The authors would also like to thank the anonymous reviewers for their valuable comments.
Conflicts of Interest
The authors declare no conflict of interest.
References
 Munea, T.L.; Lim, H.; Shon, T. Network protocol fuzz testing for information systems and applications: A survey and taxonomy. Multimed. Tools Appl. 2016, 75, 14745–14757. [Google Scholar] [CrossRef]
 Kim, H.C.; Choi, Y.H.; Lee, D.H. Efficient file fuzz testing using automated analysis of binary file format. J. Syst. Archit. 2011, 57, 259–268. [Google Scholar] [CrossRef]
 Duchêne, J.; Le Guernic, C.; Alata, E.; Nicomette, V.; Kaâniche, M. State of the art of network protocol reverse engineering tools. J. Comput. Virol. Hacking Tech. 2018, 14, 53–68. [Google Scholar] [CrossRef]
 Aminikhanghahi, S.; Cook, D.J. A survey of methods for time series change point detection. Knowl. Inf. Syst. 2017, 51, 339–367. [Google Scholar] [CrossRef] [PubMed]
 Yan, H.; Li, X.; Wang, Y.; Jia, C. Centralized Duplicate Removal Video Storage System with Privacy Preservation in IoT. Sensors 2018, 18, 1814. [Google Scholar] [CrossRef] [PubMed]
 Yang, Y.; Zheng, X.; Tang, C. Lightweight distributed secure data management system for health internet of things. J. Netw. Comput. Appl. 2017, 89, 26–37. [Google Scholar] [CrossRef]
 Tan, Q.; Gao, Y.; Shi, J.; Wang, X.; Fang, B.; Tian, Z.H. Towards a Comprehensive Insight into the Eclipse Attacks of Tor Hidden Services. IEEE Internet Things J. 2018. [Google Scholar] [CrossRef]
 Wang, Z. A privacypreserving and accountable authentication protocol for IoT enddevices with weaker identity. Future Gener. Comput. Syst. 2018, 82, 342–348. [Google Scholar] [CrossRef]
 Luo, E.; Bhuiyan, M.Z.A.; Wang, G.; Rahman, M.A.; Wu, J.; Atiquzzaman, M. PrivacyProtector: Privacy Protected Patient Data Collection in IoTBased Healthcare Systems. IEEE Commun. Mag. 2018, 56, 163–168. [Google Scholar] [CrossRef]
 Mao, Y.; Li, J.; Chen, M.R.; Liu, J.; Xie, C.; Zhan, Y. Fully secure fuzzy identitybased encryption for secure IoT communications. Comput. Stand. Interfaces 2016, 44, 117–121. [Google Scholar] [CrossRef]
 Liu, Q.; Wang, G.; Liu, X.; Peng, T.; Wu, J. Achieving reliable and secure services in cloud computing environments. Comput. Electr. Eng. 2017, 59, 153–164. [Google Scholar] [CrossRef]
 Chen, Z.; Peng, L.; Gao, C.; Yang, B.; Chen, Y.; Li, J. Flexible neural trees based early stage identification for IP traffic. Soft Comput. 2017, 21, 2035–2046. [Google Scholar] [CrossRef]
 Meng, W.; Tischhauser, E.W.; Wang, Q.; Wang, Y.; Han, J. When Intrusion Detection Meets Blockchain Technology: A Review. IEEE Access 2018, 6, 10179–10188. [Google Scholar] [CrossRef]
 Zhou, Z.; Dong, M.; Ota, K.; Wang, G.; Yang, L.T. EnergyEfficient Resource Allocation for D2D Communications Underlaying CloudRANBased LTEA Networks. IEEE Internet Things J. 2016, 3, 428–438. [Google Scholar] [CrossRef]
 Cai, J.; Wang, Y.; Liu, Y.; Luo, J.Z.; Wei, W.; Xu, X. Enhancing network capacity by weakening community structure in scalefree network. Future Gener. Comput. Syst. 2018, 87, 765–771. [Google Scholar] [CrossRef]
 Gorbunov, S.; Rosenbloom, A. AutoFuzz: Automated Network Protocol Fuzzing Framework. Int. J. Comput. Sci. Netw. Secur. 2010, 10, 239–245. [Google Scholar]
 Kitagawa, T.; Hanaoka, M.; Kono, K. AspFuzz: A stateaware protocol fuzzer based on applicationlayer protocols. In Proceedings of the IEEE Symposium on Computers and Communications, Riccione, Italy, 22–25 June 2010; pp. 202–208. [Google Scholar]
 Tsankov, P.; Dashti, M.T.; Basin, D. SecFuzz: Fuzztesting security protocols. In Proceedings of the International Workshop on Automation of Software Test, Zurich, Switzerland, 2–3 June 2012; pp. 1–7. [Google Scholar]
 Zhao, J.; Chen, S.; Liang, S.; Cui, B.; Song, X. RFSMFuzzing a Smart Fuzzing Algorithm Based on Regression FSM. In Proceedings of the Eighth International Conference on P2p, Parallel, Grid, Cloud and Internet Computing, Compiegne, France, 28–30 October 2013; pp. 380–386. [Google Scholar]
 Narayan, J.; Shukla, S.K.; Clancy, T.C. A survey of automatic protocol reverse engineering tools. ACM Comput. Surv. (CSUR) 2016, 48, 40. [Google Scholar] [CrossRef]
 Beddoe, M.A. Network Protocol Analysis Using Bioinformatics Algorithms. 2004. Available online: http://www.4tphi.net/~awalters/PI/pi.pdf (accessed on 28 October 2018).
 Cui, W.; Kannan, J.; Wang, H.J. Discoverer: Automatic protocol reverse engineering from network traces. In Proceedings of the 16th USENIX Security Symposium on USENIX Security Symposium, Boston, MA, USA, 6–10 August 2007; USENIX Association: Berkeley, CA, USA, 2007; pp. 1–14. [Google Scholar]
 Tao, S.; Yu, H.; Li, Q. Bitoriented format extraction approach for automatic binary protocol reverse engineering. IET Commun. 2016, 10, 709–716. [Google Scholar] [CrossRef]
 Zhengyang, C.; Bowen, Y.; Yu, Z.; Jianzhong, Z.; Jingdong, X. Automatic Mobile Application Traffic Identification by Convolutional Neural Networks. In Proceedings of the IEEE Trustcom/BigDataSE/SPA, Tianjin, China, 23–26 August 2016; pp. 301–307. [Google Scholar]
 Xiao, M.M.; Zhang, S.L.; Luo, Y.P. Automatic network protocol message format analysis. J. Intell. Fuzzy Syst. 2016, 31, 2271–2279. [Google Scholar] [CrossRef]
 Caballero, J.; Yin, H.; Liang, Z.; Song, D. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM conference on Computer and Communications Security, Alexandria, VA, USA, 29 October–2 Novemver 2007; ACM: New York, NY, USA, 2007; pp. 317–329. [Google Scholar]
 Cui, W.; Peinado, M.; Chen, K.; Wang, H.J.; IrunBriz, L. Tupni: Automatic reverse engineering of input formats. In Proceedings of the 15th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 27–31 October 2008; ACM: New York, NY, USA, 2008; pp. 391–402. [Google Scholar]
 Lin, Z.; Jiang, X.; Xu, D.; Zhang, X. Automatic Protocol Format Reverse Engineering through ContextAware Monitored Execution. NDSS 2008, 8, 1–15. [Google Scholar]
 Comparetti, P.; Wondracek, G.; Kruegel, C.; Kirda, E. Prospex: Protocol Specification Extraction. In Proceedings of the 2009 30th IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 17–30 May 2009; pp. 110–125. [Google Scholar]
 Caballero, J.; Poosankam, P.; Kreibich, C.; Song, D. Dispatcher: Enabling active botnet infiltration using automatic protocol reverseengineering. In Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 9–13 November 2009; ACM: New York, NY, USA, 2009; pp. 621–634. [Google Scholar]
 Meng, W.; Wang, Y.; Wong, D.S.; Wen, S.; Xiang, Y. TouchWB: Touch behavioral user authentication based on web browsing on smartphones. J. Netw. Comput. Appl. 2018, 117, 1–9. [Google Scholar] [CrossRef]
 Li, J.; Sun, L.; Yan, Q.; Li, Z.; Srisaan, W.; Ye, H. Significant Permission Identification for Machine Learning Based Android Malware Detection. IEEE Trans. Ind. Inform. 2018. [Google Scholar] [CrossRef]
 Liu, Y.; Ling, J.; Liu, Z.; Shen, J.; Gao, C. Finger Vein Secure Biometric Template Generation Based on Deep Learning. Soft Comput. 2018, 22, 2257–2265. [Google Scholar] [CrossRef]
 Yuan, C.; Li, X.; Wu, Q.; Li, J.; Sun, X. Fingerprint Liveness Detection from Different Fingerprint Materials Using Convolutional Neural Network and Principal Component Analysis. CMCComput. Mater. Contin. 2017, 53, 357–371. [Google Scholar]
 Meng, W.; Jiang, L.; Wang, Y.; Li, J.; Zhang, J.; Xiang, Y. JFCGuard: Detecting juice filming charging attack via processor usage analysis on smartphones. Comput. Secur. 2018, 76, 252–264. [Google Scholar] [CrossRef]
 Chen, S.; Wang, G.; Yan, G.; Xie, D. Multiimensional fuzzy trust evaluation for mobile social networks based on dynamic community structures. Concurr. Comput. Pract. Exp. 2017, 29, e3901. [Google Scholar] [CrossRef]
 Li, P.; Li, J.; Huang, Z.; Gao, C.Z.; Chen, W.B.; Chen, K. Privacypreserving outsourced classification in cloud computing. Clust. Comput. 2017. [Google Scholar] [CrossRef]
 Li, P.; Li, J.; Huang, Z.; Li, T.; Gao, C.Z.; Yiu, S.M.; Chen, K. Multikey privacypreserving deep learning in cloud computing. Future Gener. Comput. Syst. 2017, 74, 76–85. [Google Scholar] [CrossRef]
 Li, J.; Zhang, Y.; Chen, X.; Xiang, Y. Secure attributebased data sharing for resourcelimited users in cloud computing. Comput. Secur. 2018, 72, 1–12. [Google Scholar] [CrossRef]
 Gao, C.Z.; Cheng, Q.; Li, X.; Xia, S.B. Cloudassisted privacypreserving profilematching scheme under multiple keys in mobile social network. Clust. Comput. 2018. [Google Scholar] [CrossRef]
 Luo, E.; Liu, Q.; Abawajy, J.H.; Wang, G. Privacypreserving multihop profilematching protocol for proximity mobile social networks. Future Gener. Comput. Syst. 2017, 68, 222–233. [Google Scholar] [CrossRef]
 Zhi Gao, C.; Cheng, Q.; He, P.; Susilo, W.; Li, J. Privacypreserving Naive Bayes classifiers secure against the substitutionthencomparison attack. Inf. Sci. 2018, 444, 72–88. [Google Scholar]
 Shiryaev, A. On Optimum Methods in Quickest Detection Problems. Theory Probab. Appl. 1963, 8, 22–46. [Google Scholar] [CrossRef]
 Lorden, G. Procedures for Reacting to a Change in Distribution. Ann. Math. Stat. 1971, 42, 1897–1908. [Google Scholar] [CrossRef]
 Page, E.S. Continuous Inspection Schemes. Biometrika 1954, 41, 100–115. [Google Scholar] [CrossRef]
 Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 1984, 30, 629–636. [Google Scholar] [CrossRef]
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).