Abstract

The widely used network protocols play a crucial role in various systems. However, the protocol vulnerabilities caused by the design of the network protocol or its implementation by programmers lead to multiple security incidents and substantial losses. Hence, it is important to study the protocol fuzzing in order to ensure its correctness. However, the challenges of protocol fuzzing are the mutation of protocol messages and the deep interactivity of the protocol implementation. This paper proposes a model-based grey-box fuzzing approach for protocol implementations, including the server-side and client-side. The proposed method is divided into two phases: automata learning based on the minimally adequate teacher (MAT) framework and grey-box fuzzing guided by the learned model and code coverage. The StateFuzzer tool used for evaluation is presented to demonstrate the validity and feasibility of the proposed approach. The server-side fuzzing can achieve similar or higher code coverage and vulnerability discovery capability than those of AFLNET and StateAFL. Considering the client, the results show that it achieves 1.5X branch coverage (on average) compared with the default AFL, and 1.3X branch coverage compared with AFLNET and StateAFL, using the typical implementations such as OpenSSL, LibreSSL, and Live555. The StateFuzzer identifies a new memory corruption bug in Live555 (2021-08-25) and 14 distinct discrepancies based on differential testing.

1. Introduction

With the rapid development of computer networks, more applications are integrated into the network applications. Protocols play an essential role in cyberspace, as the carrier of various network transmissions. The logic errors in the design process and implementation bugs lead to vulnerabilities, which result in a significant harm, that is, the heartbleed vulnerability of OpenSSL [1], CCS injection vulnerability [2], and Server Message Block protocol vulnerability (CVE-2020-0796).

Several automated software testing techniques have been proposed to find vulnerabilities. Compared with the symbolic execution and code auditing, fuzzing is one of the most efficient techniques for detecting security vulnerabilities in real-world software due to the fact that it is user-friendly and efficient. However, several challenges exist for fuzzing on servers, that is, protocol implementations.

In contrast to regular programs, protocol implementations process inputs according to the basic state model, which determines the processing logic of all the interactive messages. The American Fuzzing Loop (AFL) [3] and LibFuzzer [4] are popular tools belonging to the coverage-based grey-box fuzzing (CGF). They tackle the stateless programs and concrete functions without an in-depth interaction. The previously described single-input fuzzing is considered as stateless fuzzing. The stateful black-box fuzzing tools (SBF), such as Peach [5] and Boofuzz [6], are other well-known techniques. In the approaches, the effectiveness depends on the given state machine, which is obtained from the RFC specification or the captured traffic data.

The protocol state fuzzing [7] is another branch of protocol testing. It first learns the state machine of the protocol implementation by black-box testing and then finds the suspicious logic by comparing the learned model and the specification or analyzing the differences between several different versions. However, it can only find the logical errors by manual comparison, while lacking the ability to discover crashes such as buffer overflow vulnerabilities.

Hence, it is important to combine the state machine and grey-box fuzzing. The existing approach consists in dynamically constructing the state machine while fuzzing, which helps generate the complete state model, that is, AFLNET [8]. However, the state model is inaccurate, which leads to interesting tests loss and redundant tests. Hence, the grey-box fuzzing based on the learned model from active automata learning is performed.

Most of the protocol fuzzing tools only tackle the server programs, while ignoring the client testing. Memory bugs and semantic errors exist in the client. Fiterau-Brostean [9] applies protocol state fuzzing to the sliding window behavior of TCP, while the learned model through active learning is an abstract model, which may not cover the abnormal behavior. The Secure Copy Protocol (SCP) client cannot verify that the object returned by the SCP server corresponds to what was requested, which results in the malicious manipulation of the server or man-in-the-middle attacker (CVE-2019-6110). Hence, it is important to fuzz the client-side programs of the application layer protocols.

The contributions of this paper are summarized as follows:(1)A model-based grey-box fuzzing framework, which consists of automata learning and state-aware grey-box fuzzing, is proposed.(2)StateFuzzer (https://gitee.com/z11panyan/state fuzzer.git), which can fuzz the server-side and client-side implementations of the application layer protocols, is implemented.(3)The experimental results of the open-source SSL library (OpenSSL and LibreSSL) show that the tool StateFuzzer can achieve 1.5X code coverage (on average) compared with the default AFLNWE and 1.3X code coverage compared with AFLNET. In addition, the effects based on SMTP and RTSP are compared, and a new memory bug and an undisclosed vulnerability are found.

The remainder of this paper is organized as follows. The existing technologies are classified in Section 2. The motivation is introduced in Section 3. The proposed method is detailed in Section 4. The experimentation and evaluation are presented in Section 5. The related studies are discussed in Section 6. Finally, the conclusion and perspective are drawn in Section 7.

2. Background

In the absence of the naming convention and classification of the existing technologies, this paper attempts to classify them according to the different emphases and targets of the current technologies.

2.1. Software Testing and Fuzzing

Software testing, which is the technique of verifying the correctness and errors determination of the program according to rules, can be divided into specification-based and code-based testing [10]. According to Tretmans [11], if the implementation conforms to the rules , the relationship between the rules and the implementation can be written as . Otherwise, the implementation violates the rules, and it is denoted by . The rules can be divided into specification-based and code-based rules, denoted by . The code-based rules are basic program rules. For instance, the use-after-free, buffer overflow, and double-free are strictly forbidden. The specification-based rules are extracted from request for comments (RFC).

Fuzzing is one of the most efficient techniques of software testing. A test suite that consists of test cases generated from the rules is tested as the input of implementation in order to discover abnormal behaviors. A test case is a pair , where and . According to the rules, the expected output of the input is , while the output on the implementation is . The test cases conforming to the rules can be written as . Hence, fuzzing can be described as , then . Similarly, fuzzing can be divided into specification-based and code-based. As for the code-based fuzzing, the output of the rules can be defined as , namely . The generation of the input is critical. For protocol fuzzing, some analysts focus on the generation of the single message, and the others pay more attention to stateful fuzzing.

In the specification-based fuzzing, the specification is extracted from RFC. In addition, it is nontrivial to obtain the output of from RFC. Hence, protocol state fuzzing and differential testing are proposed to reduce the number of test cases.

Protocol state fuzzing is also referred to as learning-based testing or model-based testing. The protocol state fuzzing consists in first inferring a state machine from the protocol implementation based on active automata learning. The state machine is then checked against the specification. The details are provided in Section 2.2.

Due to the fact that it is difficult to check if , differential testing is one of the approaches used to find the inconsistencies. Given two implementations and , if , the test case should be analyzed.

The learning-based fuzzing [27] can be considered as the combination of two methods. It first learns a hypothesis model by active automata learning and then tries to find the inputs that reveal nonconformance between another implementation and the hypothesis. Hence, the learning-based fuzzing is considered as differential testing, in which the generator of test cases is based on the state machine.

In summary, the existing approaches are divided into specification-based and code-based fuzzing (cf. Table 1). Note that the related work in Section 5 is reviewed according to this classification.

2.2. Active Automata Learning

Active automata learning is an active method of model learning. It is one of the most efficient algorithms for inferring the model of the black-box system. The MAT [40] is a widely used active learning framework. It includes a learner, who only knows the input and output symbols of the system under learning (SUL), and a teacher, who knows all the information of the target system. The learner learns the unknown model by querying the teacher. More precisely, the learner proposes a membership query (MQ) by sending a message sequence to SUL. If SUL accepts it, the teacher returns “yes”. Otherwise, it returns “no”. The learner then tries to construct an automaton (a.k.a. hypothesis) based on the learning algorithm and submits it to the teacher. The teacher can judge whether the behavior of the automaton matches the target system. Otherwise, the teacher gives a counterexample, which is referred to as an equivalence query (EQ).

The mealy machine is one of the most common models. It can be defined as a 4-tuple based on and , where is the finite set of input symbols, represents the finite set of output symbols, denotes the finite set of states, is the initial state, represents the state transition, and denotes the output function. They can be written as and . In the initial state, the state transition and output function can be written as and , respectively. Protocol implementations can be abstracted into mealy machines, where the messages sent by the client and server can be simplified as the input and output symbols.

In order to apply this technology to the realistic system with a large number of inputs and outputs, Aarts [41] added a component “mapper” into the MAT framework. The mapper is located between the learner and SUL and plays the role of abstraction and concretion. More precisely, the learner sends an abstract symbol to the mapper which converts it into a specific message based on the input alphabet and sends it to SUL. Simultaneously, the mapper converts the response back to an abstract symbol and returns it to the learner. Finally, a hypothesis model is equivalent to the implementation.

Given the previously mentioned statements, protocol state fuzzing can be divided into two stages. The first stage is the generation of based on active automata learning, where is a sequence of symbols. In the second stage, if , the test case triggers a semantic bug.

3. Motivation

Based on the previous analysis, this paper pays more attention to stateful fuzzing. The program is an exhaustive state machine with a large state space and a big input/output alphabet. Thus, it is challenging to explore the whole state space. A fuzzer tries to sample from “interesting” regions of the state space as efficiently as possible. The analysts can abstract the large state set into a smaller set which is supposedly separated by certain operations, as shown in Figure 1(a). The fuzzer can explore edge tuples on this state machine [42]. The feature is more prominent in the protocol implementations. Therefore, the analysts attempt to fuzz protocol implementations based on the state machine.

The state-of-the-art state-guided protocol fuzzing approach, namely AFLNET, constructs the state machine based only on the server responses, which is not the case with the mealy machine. If a new status code exists in the server response, a new state is added. Figure 1(b) presents the state machine of Live555 obtained from AFLNET, where a graph node represents a new state, marking the states with the status code, and the state with the label “0” is the initial state.

AFLNET relies on status codes from messages, leading to interesting tests loss and redundant tests. Considering the sequence “200-404-454” as an example, it means that an input sequence, of which the response sequence is “200-404-454”, exists. If the response sequence is “454–405”, which does not exist in Figure 1(b), it is considered as an interesting transition, and the relative input sequence is added to the seed pool. Because the sequence “200-200” exists, the response sequence “200-200-200” is not interesting. However, the input sequence of “200-200” may be “DESCRIBE-SETUP” and that of “200-200-200” may be “DESCRIBE-SETUP-PLAY.” The sequence “200-200-200” should be regarded as an interesting sequence. In addition, the input sequence of the response sequence “200-200-200” may be “DESCRIBE-SETUP-PLAY” or “DESCRIBE-SETUP-SETUP”, which is not distinguished in AFLNET.

The state machine in Figure 1(c) is the mealy machine learned by active automata learning. The model considers the inputs and outputs, which is coherent with the programmer’s logic. Moreover, the issues in AFLNET can be avoided when considering the inputs. As previously discussed, “DESCRIBE-SETUP-PLAY” is considered an interesting transition. Furthermore, the same output “200” of different inputs “SETUP” and “PLAY” has different meanings, which can be distinguished by the mealy machine. Other protocols, such as SMTP, have the same problem.

In summary, the expressiveness of the mealy machine is better than that of AFLNET. This paper takes advantage of the coverage and state transition in order to guide the fuzzing on top of active automata learning.

4. Methodology

Based on the previous analysis, a model-based grey-box fuzzing framework is proposed (cf. Figure 2). A high-level overview of the proposed method is first provided, followed by a detailed description. The approach is divided into two stages: learning and fuzzing. The target of the learning process is the state machine based on given alphabets. Due to protocol systems feature, it is necessary to construct a mapper between the learner and SUL, also referred to as system under testing (SUT).

At the stage of fuzzing, the test cases are generated based on the state machine and mutation, and the weights of seeds are adjusted by the code coverage. The test cases causing crashes and coverage increases are exported to a specified structure. In addition, a differential checker is provided to identify semantic bugs.

4.1. Learning Phase

The learning component reuses the StateLearner tool. It learns the state machine by calling the interface of LearnLib, which provides different learning algorithms, such as LStar [43] and TTT [44], and the equivalence query algorithms, such as w-method and modified w-method. The mapper, also referred to as the test harness, is the core of the StateLearner. The test harness plays a crucial role as a stateless client, which converts symbols of the given alphabet to concrete messages, and sends them in order.

Different methods exist for building the mapper for encryption protocols and plaintext protocols. For encryption protocols, the encryption and decryption components should be manually adjusted in order to realize the mapper [7]. As for plaintext protocols, the keywords representing the protocol state can be extracted from the existing data packets, based on protocol reverse analysis. The mapper can be constructed according to the keywords and their corresponding messages, where the keywords are abstract symbols and messages are concrete inputs.

Afterwards, considering the RTSP protocol as an example, the mapper construction is explained. The keyword is the first field of the message separated according to the space character. The alphabet is composed of all the unique keywords, and the concrete inputs are related messages, as shown in Table 2. The first 3 bytes of the response data are extracted as abstract outputs. Especially, if the response times out, the output is set to “empty”. Once a connection is closed, all outputs returned afterwards will be the same (referred to as connection closed).

The state machine of the RTSP implementation can be inferred based on this mapper. The state machine is represented by a structure array, which contains three elements: a state identification, the outputs of all the symbols in the alphabet, and the target state of all the symbols. Based on the state machine, the output and the target state of each symbol in each state can be obtained.

For different protocols, the Request Sequence Parser component in AFLNET uses protocol-specific information of the message structure to extract status codes. Similarly, we can construct the mapper as described previously, which has the same expansibility for other protocols as AFLNET.

4.2. Grey-Box Fuzzing

Due to the complexity of the protocol message structure, the critical points of grey-box fuzzing are the mutation and the scheduling algorithm based on the coverage and the state machine. The primary problem consists in how to use the state machine. Not only states but also the state transitions highly affect the execution of protocol implementations, as stated by Zou et al. [39]. The interesting transitions are obtained from the state machine based on the breadth-first search, as shown in Algorithm 1. The inputs of the algorithm are the model, its initial state , and the sink state . The output and target state of each symbol on each state can be obtained based on the model. The output of any symbol on the sink state is “connectionclosed”. The lists of states, input symbols, and output symbols, recorded as , and , form a structure to represent the path. The target state and output of each symbol are obtained on the current path. If the target state is already in the state list or the target state is the sink state, then the path is added into the set. Otherwise, it is pushed into the queue. In addition, the output is the set of interesting paths, in which every path contains the lists of states, input symbols, and output symbols. Simultaneously, the mapper stores the correspondence between the symbol and the message. For each path, a standard message sequence denoted by can be obtained.

(1)
(2)
(3)
(4)
(5)
(6)
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)   
(14)   
(15)  
(16)   
(17)   
(18)  
(19)
(20)
(21)
(22)

The scheduling strategy of fuzzing is based on two assumptions: (i) the “deeper” the network communication, the more likely an error exists; (ii) the more code edges are covered, the more likely an error exists. It is important to mention that different paths reaching the same state may execute different codes. Hence, a hierarchical scheduling strategy is designed. The seeds are classified according to the paths, as shown in Figure 3. A seed pool is constructed for each path.

In order to determine the initial weight of every path and every seed, every standard message sequence is mutated as a seed. After a certain number of mutations, the initial score and seeds can be obtained. This is referred to as the initialization phase, as stated in Algorithm 2. The random phase based on the hierarchical scheduling strategy is then executed. When selecting the seed, it first determines the path according to their weights and then chooses the seed from the pool of the path. In addition, the interesting sequence is added to the seed pool with the related weight, and the weight of the seed pool is increased.

(1)
(2)
(3)
(4)
(5)  
(6)  
(7)  
(8)   
(9)   
(10)  
(11)

In contrast to fuzzing with a single input, protocol fuzzing requires deep interaction. In other words, the test cases are sequences of multiple messages. Given a seed , its index is first randomly selected, such as the second message . The mutated message sequence is . Most of the mutation strategies, such as bitflip, shuffling, erasing, swapping, inserting, and splicing, are implemented. The ordered message sequence is then regarded as a single seed, and a message is treated as a byte. The mutation strategies include splicing between sequences and disordering within a single sequence.

4.3. Client Side

Several open-source implementations of protocols include both the server-side and client-side functions. The client-side vulnerabilities are often overlooked, which leads to hazards. The client-side is tested in a similar way to the server-side. The tool needs to interact with the client by simulating the server. It can construct different types of response packets, rather than implementing the complete logic of the server-side.

The difference between fuzzing on the server-side and client-side is that the client to be tested should connect to the simulated server actively. The tested code heavily relies on the client’s request. Considering RTSP as an example, different codes are executed when requesting to play different types of files. Hence, as many types of the request as possible should be tested.

At the learning stage, the input and output symbols are the opposite of the servers. The same three-byte status codes may have different meanings. For instance, “200” in RTSP may be the status code of DESCRIBE or SETUP. When fuzzing the client, the status codes are the input symbols. Hence, they should be marked with the meaning, as shown in Section 5.1.

At the fuzzing stage, the process is similar to that of the server-side, without the more specific process to change.

4.4. Differential Checker

Except for memory bugs, it is worth focusing on semantic bugs in protocol implementations. Semantic bugs refer to the conflicts between the implementation and RFC specification. Most of the fuzzers (such as AFLNET) detect memory bugs based on sanitizers that are powerless to detect semantic vulnerabilities. Differential testing is used to detect the differences among the protocol implementations, due to their diversity. The semantic errors are further detected by manual analysis.

Based on the idea of differential testing, a differential checker component is designed to help discover inconsistencies in different implementations. In fact, it is meaningless to compare the responses of the implementations because they contain timestamps and random fields. Hence, a response is abstracted as a symbol based on the mapper. Inevitably, some subtle inconsistencies may be lost. However, this avoids the large-scale analysis of the RFC documents and reduces the difficulty and cost of the manual analysis.

Timestamps, random, and some nondeterministic fields exist in the responses. It is crucial to define a reasonable metric that can assess whether the responses are discrepant or in agreement. The TLS-diff defines the reduction function that maps a TLS implementation’s response [25]. The responses of ClientHello are divided into Handshake and Alert based on the reduction function. In this paper, a response is abstracted as a symbol based on the mapper.

The detailed algorithm of the testing phase is similar to the fuzzing stage. However, their output standards are not the same. The outputs of the differential checker are the test cases leading to the different responses.

In order to avoid the excessive duplication of differences due to the same root cause, the deduplication strategy is used. That is, if the sets of basic blocks triggered by two test cases are the same, the two test cases are considered as duplicates [45].

5. Experimentation and Evaluation

The StateFuzzer is developed based on Java, while the fuzzing component is common for different protocols. For different types of protocols, the learning component requires a customized development. The code coverage is obtained based on LLVM. According to LibFuzzer, which has a simple code coverage instrumentation built-in (SanitizerCoverage), the source code of protocol implementations is slightly modified slightly.

The experimental environment and results are detailed in the sequel. All the experiments are performed on an Ubuntu server (16.04 LTS) with 4 CPUs and 8 GB RAM.

5.1. Experimental Setup

We performed experiments on ProFuzzBench, a public benchmark for network fuzzers [38]. The target programs are slightly modified in the previously described manner and compiled based on CLANG. The latter will feedback on the code coverage to guide the selection and mutation of seeds. After fuzzing, the recorded test cases are sent to the standard program compiled by GCC with the “-ftest-coverage” parameter. The results of the lines and branches coverage are simultaneously generated. In particular, the corresponding relationships between the acronyms and meanings are l_abs (lines absolute), l_per (lines percentage), b_abs (branches absolute), and b_per (branches percentage), and abbreviations are used in the following text.

The efficiency is evaluated by comparing three baseline approaches, a stateless coverage-guided fuzzer (AFLNWE (https://github.com/profuzzbench/aflnwe Pham ported the)) and two grey-box stateful fuzzers (AFLNET and StateAFL). The TLS encryption protocol as well as the SMTP and RSTP plaintext protocols are used. The target implementations are OpenSSL3.0.0(0437435a), OpenSSL1.0.1f(0d877634), LibreSSL3.2.1, Exim (Version 4.98), and Live555 (0.92). For the sake of reducing the influence of initial conditions as much as possible, the same seeds as AFLNET are used. The used parameters of automata learning are “LStar” and “Modified W-method”.

In order to learn the model of the three protocols, the alphabets are defined as follows:(1)TLS server: ClientHello (RSA and DHE), Certificate (RSA and empty), ClientKeyExchange, ClientCertificateVerify, ChangeCipherSpec, Finished, and ApplicationData (regular and empty)(2)TLS client: ServerHello (RSA and DHE), Certificate (RSA and empty), CertificateRequest, ServerKeyExchange, ServerHelloDone, ChangeCipherSpec, Finished, and ApplicationData (regular and empty)(3)RTSP server: OPTIONS, DESCRIBE, SETUP, TEARDOWN, PLAY, and PAUSE(4)RTSP client: DESCRIBE (200), SETUP (200), TEARDOWN (200), PLAY (200), Session_NotFound (454), Stream_NotFound (404), Method_NotAllowed (405), and BadRequest (400)(5)SMTP server: HELO, RSET, MAIL, RCPT, DATA, and QUIT

Due to the fact that the three baseline approaches do not support the fuzzing of client programs, the experimental results are divided into two parts for analysis. The effects of fuzzing only on the server-side programs are compared, and the gains from fuzzing on the client-side programs are analyzed, respectively. For most of the implementations, the growth in code coverage leveled off after 4 hours of fuzzing. Each fuzzing tool on each target program can be reached for 4 hours, and the experiment is repeated 4 times to establish the statistical significance of the results. The follow-up figures and tables are based on the average values using the above experimental method.

When analyzing the gains from client-side fuzzing, the server and client are tested for 2 hours. In order to obtain the coverage information, the recorded inputs of fuzzing the server and the client are replayed. Since OpenSSL, LibreSSL, and Live555 are also available as clients, the coverage of the server and client can be put together.

5.2. Fuzzing Performance

Figure 4 presents the average percentage of branches and lines covered by AFLNWE, AFLNET, StateAFL, and StateFuzzer within four hours for four repetitions on OpenSSL3.0.0 (0437435a). It can be deduced from the obtained results that three stateful network fuzzers show an evident increase in branch and line coverage, while the stateless fuzzer has a moderate efficiency. When fuzzing only the server, the effect of our method is slightly better than AFLNET and StateAFL.

Figure 5 shows the lines percentage for each implementation and each fuzzer after 4 hours. As for the TLS (OpenSSL and LibreSSL) and SMTP (Exim), a significant improvement in code coverage exists with the three stateful fuzzers, while the four fuzzers cover similar lines and branches for the RTSP (Live555). The reason is that when all the messages are sent in one packet, the server of Live555 can handle it normally. StateAFL eliminates the need for protocol-specific parsing, while its effect is the worst of the three stateful fuzzers. Especially, as for the encryption protocol, StateFuzzer has a better effect than the other two stateful tools, due to the encryption and decryption processes being realized in order to interact with the target program more deeply.

Moreover, the lines covered by StateFuzzer and AFLNET are compared in detail. It can be seen from Figure 6 that for OpenSSL3.0.0, Exim4.98, and Live555 (0.92), most of the lines covered by the two methods are the same. Considering OpenSSL3.0.0 as an example, there are 9503 lines covered together, while StateFuzzer covers 471 additional lines and AFLNET covers 87.

When introducing the client fuzzing, more covered paths that the server fuzzing cannot cover exist. In Figure 5, the grey color above the red color represents the increased lines covered by the client fuzzing. The code coverage is highly increased by fuzzing on the client-side program. Since the implementation “Exim” does not have a client-side functionality, the grey color does not exist. Table 3 illustrates the average growth rate of four implementations. For OpenSSL3.0.0, 1.49X branch coverage (on average) is achieved compared with the default AFLNWE, while 1.27X code coverage is achieved compared with AFLNET. The branch coverage increases by 85% and 30% compared with AFLNWE and AFLNET on Libressl3.2.1, respectively. For Live555 (ceeb4f46), a 34% increase in branch coverage exists.

Besides, a comparison between the black-box and grey-box is performed. Table 4 illustrates that the effect of grey-box fuzzing is better than that of the black-box which lacks coverage guidance.

5.3. Memory Bug Discovery

In order to compare the vulnerability discovery capacities between the four tools, this paper focuses on Live555 (ceeb4f46) and collects the vulnerabilities triggered by sending exception packets. Four known vulnerabilities meeting the requirements exist as follows: CVE-2020-24027, CVE-2021-38381, CVE-2021-39282, and CVE-2021-38383. The results show that all the fuzzers discover four known vulnerabilities. Simultaneously, a new crash is detected by StateFuzzer. After manually analyzing it, the vulnerability also exists in its latest version, and it is submitted to the vendor for patching. The root cause of the vulnerability is analyzed. It is deduced that the order of its packets is specific. If the client sends a package including PLAY and SETUP after sending SETUP and PLAY several times, a heap-use-after-free bug exists when processing the Matroska file. In addition, there should be a specific time interval between the packets, which is difficult to trigger. Due to the random nature of the mutation, not every fuzz will trigger the vulnerability. It takes almost about three hours to activate it, while it only takes 15 minutes to trigger CVE-2021-38381 and CVE-2021-39282. Moreover, an undisclosed stack buffer overflow, which was patched in 2018.10.17, is found.

Furthermore, for Exim4.89 (SMTP), StateFuzzer is the only fuzzer able to discover CVE-2017-16943 and to provide a new PoC test case.

5.4. Inconsistencies

Large differences exist between the different versions. Considering Exim4.93 and Exim4.94 as an example, HELO is required at the beginning of the normal functional logic in Exim4.94, while it is unnecessary in Exim4.93, which results in different state machines. Essentially, Exim4.93 violates the rule “The first command in a session must be the HELO command” in RFC821. Hence, several relatively recent versions with the same state machine are selected, such as OpenSSL3.0.0 vs. OpenSSL1.1.1 h (f123043) vs. LibreSSL3.2.1, Exim4.93 vs. Exim4.89, and Live555 (0.92) vs. Live555 (1.02). Finally, some meaningful test cases are manually analyzed.

5.4.1. OpenSSL3.0.0 vs. OpenSSL1.1.1 h (f123043) vs. LibreSSL3.2.1

There are four discrepancies between OpenSSL and LibreSSL when processing the ClientHello. They make inconsistent judgments of the content fields in parsing some extensions. For instance, for parsing renegotiation extensions, OpenSSL only parses the length field, while LibreSSL makes further judgments about subsequent bytes. In addition, StateFuzzer can discover deep inconsistencies that require interactions. For example, OpenSSL3.0.0 will immediately alert when receiving an invalid Diffie-Hellman client key exchange, while LibreSSL3.2.1 and OpenSSL1.1.1 h do not immediately respond, which is not clearly defined in RFC5246. The latter only explicitly stipulates that “in any case, a TLS server MUST NOT generate an alert if processing an RSA-encrypted premaster secret message fails”, for the RSA-encrypted premaster secret message.

Simultaneously, the client-side programs are analyzed based on differential testing. For instance, there are two version fields in ServerHello. The first invalid version field is ignored in OpenSSL, while it results in an alert in LibreSSL. Similarly, deeper inconsistencies can be detected in the client-side programs. The processing of LibreSSL is different from that of OpenSSL (https://github.com/openssl/openssl/issues/4320), for the negative serial number in the certificate. It is considered as the illegal extra padding in OpenSSL, while it passes the verification in LibreSSL.

5.4.2. Exim4.93 vs. Exim4.89

An inconsistency exists in Exim4.89. Any BDAT command sent after the BDAT LAST is illegal and MUST be replied to with a 503 “Bad sequence of commands” reply code, as described in rfc3030. Exim4.89 ignores the specification, which results in the CVE-2017-16943.

5.4.3. Live555 (0.92) vs. Live555 (1.02)

Because the known vulnerabilities cause the program to crash, they also cause differences between the two implementations.

To sum up, 14 distinct discrepancies are discovered in the experiments (cf. Table 5). Some discrepancies are caused by implementation errors, and some are caused by the implementers’ different understanding of RFC. The test cases that caused crashes or discrepancies are provided in the supplementary material.

5.5. Discussion

The proposed method combines the advantages of active automata learning and grey-box fuzzing. The experiment results demonstrate that the model-based grey-box fuzzing is valuable. It highly contributes to the exploration of more paths within a limited period, owing to the more accurate abstract model based on active automata learning and fuzzing for the client. Similar tools such as Boofuzz and Peach exist. They require a lot of effort to manually construct the input and code the state machine. Compared with them, the automation level is improved by using active automata learning.

This section focuses on the three studies that are also critical reference objects of the presented work. The first method is the learning-based fuzzing which combines automata learning and fuzz testing. Its target is the black-box system, and it focuses on inconsistencies among multiple implementations. In addition, the instrumentation feedback is not leveraged, which allows the fuzzer to distinguish the inputs that execute new branches from the inputs that do not reach a new code.

The other two methods are the grey-box fuzzing tools for network protocols on the top of AFL. AFLNET, StateAFL, and the proposed method belong to two different technical roadmaps. Compared with the traditional fuzzing tools, AFLNET uses state feedback and coverage feedback to guide the mutation of seeds, while treating the message sequences as the fuzzing input to enable deep interaction with protocol implementations. Moreover, StateAFL [37] analyzes the states through the memory characteristics by inserting probes on memory allocations and network I/O operations. It increases the degree of automation by eliminating the need for relying on detailed manual analysis of protocols. However, it is necessary to make specific modifications for special memory operations in the source code.

Compared with AFLNET and StateAFL, StateFuzzer requires more understanding and analysis to construct the test harness for different protocols. If the test harness for the protocol is developed, different implementations of the given protocol can be learned and fuzzed, which is the focal point of our future work. In addition, due to the lack of research on the client-side programs, the client programs are fuzzed based on the presented framework, and a differential checker is designed to improve the efficiency of finding semantic bugs.

Besides, the proposed method only focuses on the state machine of the protocol. Because the protocol messages are highly structured inputs, it is crucial to pay more attention to how to mutate a single message, which is analogous to format-aware fuzzing. Fioraldi et al. [46] divide the techniques of format-aware fuzzing into grammar-based, where the inputs comply with a language grammar, and chunk-based, where the inputs are represented by a tree hierarchy with C structure-like data chunks to form individual nodes. The fuzzing of text protocol is suitable for the grammar-based technique, while the binary protocol corresponds to chunk-based ones. In future work, we aim at extending the format-aware fuzzing to protocol fuzzing [47].

6.1. Fuzzing

Fuzzing is one of the research hotspots in vulnerability mining, which automatically constructs invalid inputs and sends them to the target software. The generation of test cases and the guidance strategy of feedback are two key points. According to the feedback provided by the runtime program, it is usually divided into black-box, grey-box, and white-box. The black-box fuzzing does not require any feedback data from the target program. It pays more attention to the mutation methods, such as input structure-based mutation [6] and generation strategy based on deep learning [47]. The white-box fuzzing consists in constructing test cases based on the internal logic of the programs and maximizing the code coverage by using dynamic symbol execution [48, 49] and taint analysis [50]. SAGE [51] is one of the representative tools. However, the preconditions and complexity of the white-box fuzzing are relatively high. The grey-box fuzzing mainly focuses on code coverage (basic blocks, paths, functions, etc.) and data flow. Popular tools, such as AFL and LibFuzzer, obtain the code coverage at runtime through code instrumentation (i.e., source code and binary).

Researchers have summarized the open challenges and opportunities for fuzzing [52, 53]. The related works for testing on protocols are presented in the sequel.

6.2. Code-Based Fuzzing

Due to the features of protocols, the methods of seed construction and stateful fuzzing are vital points of protocol fuzzing. As for the test cases, Walz et al. [25] propose a tree structure, referred to as the general message tree (GMT), in order to describe the specific TLS messages (ClientHello). The GMT and the message can be converted to each other to mutate the TLS messages efficiently. In addition, researchers try to generate test cases based on deep learning, such as Seq2Seq [34] and GAN [32]. For the sake of TLS protocol, Somorovsky [27] implements an open-source and extensible TLS protocol fuzzing framework by designing all the protocol fields as the variables. This method can dynamically construct TLS messages and TLS records. In addition, due to the wide use and insecurity of IoT protocols, more researchers focus on the fuzzing of IoT. For instance, Chen et al. [30] obtain the protocol fields of interest by analyzing the code of the IoT application and mutating the relative fields to fuzz the IoT device. Similarly, Snipuzz [35] is a black-box fuzzing tool for IoT firmware by inferring and mutating message snippets.

Format awareness can boost CGF. In fact, it is worthwhile to draw lessons from the techniques of format-aware fuzzing and apply them to the mutation of protocol messages. Aschermann et al. [54] take advantage of the description-based input generation and feedback-based fuzzing to improve the efficiency of generation. Blazytko et al. [55] propose the GRIMOIRE synthesizing structure while fuzzing. GRIMOIRE reduces the interesting input to the fragment which causes a new coverage and generates the new inputs by recursive replacement and fragment splicing. Pham et al. [56] construct a virtual structure to represent the input file. The files are decomposed into fragments based on the specification. The fragments are internal nodes of the virtual structure. Moreover, they define innovative mutation operators that work on the virtual structure and convert them into files after mutation. Gopinath et al. [57] infer the syntax of strongly formatted input using dynamic control flow analysis. The obtained grammars are well-structured and very readable.

With respect to stateful fuzzing, Chen et al. [36] set the forkserver point at the state switching point of the test program. Simultaneously, a queue array of state test cases is maintained, and the corresponding test case data packets are sent when testing different states. The test program automatically forks the program according to the data package and simultaneously calculates the related parameters such as the code coverage. However, this method requires a very detailed analysis of the source code. TCP-Fuzz [39] uses a transition-guided fuzzing approach that exploits a novel coverage metric as program feedback, referred to as branch transition.

6.3. Specification-Based Fuzzing
6.3.1. Protocol State Fuzzing

Cho [12] first applied automata inference to control and command protocols. The researchers then began to test with various protocols as the target. Aiming at bridging the gap between active learning and real-world systems, Aarts implements TOMTE on the basis of LearnLib. It formally describes the abstract and concrete behavior of the mapper and implements it in TOMTE. Consequently, Ruiter et al. [7] model a state machine of the TLS protocol implementation based on the active learning method and manually analyze the generated state machine to find logical vulnerabilities. More protocols, such as TCP [9], SSH [13], OpenVPN [14, 15], QUIC [16], IPSec [17], and DTLS [18], are analyzed to compare the state machines with the protocol specification.

6.3.2. Differential Testing

The idea of differential testing is applied in different phases. For instance, Brubaker et al. [21, 22, 23] analyze the certificate verification algorithm in SSL/TLS implementations based on differential testing. HVLearn focuses on hostname validation [24], while TLS-diff pays more attention to the TLS handshake process [25]. TCP-Fuzz compares the outputs of multiple TCP stacks for the same inputs, in order to dig semantic bugs based on a differential checker [39].

7. Conclusion

This paper proposed a novel strategy for stateful protocol fuzzing, termed model-based grey-box fuzzing. A state machine is inferred based on active automata learning, and test cases are generated according to the state machine and seed pool. StateFuzzer is implemented on top of StateLearner, and the method is applied for fuzzing on implementations of three protocols, such as OpenSSL, LibreSSL, Exim, and Live555. Compared with AFLNET and StateAFL, the proposed method achieves higher lines and branches coverage in the same span of time, especially with the introduction of client-side fuzzing. In addition, differential testing is introduced to detect inconsistencies between implementations. Both the server-side and client-side programs can be analyzed based on the differential checking.

In future work, we aim at extending the test harnesses for different protocols, such as FTP and DTLS, in order to further verify the effectiveness of the proposed method and expose previously unknown bugs. For the TLS protocols, this paper only studied the code about tls1.2. An intensive study of tls1.3 and tls1.1 is then of our interest. Finally, we also expect to perform the interaction with encrypted data.

Data Availability

The source code data and experimental results used to support the findings of this study have been deposited in the Git repository, https://gitee.com/z11panyan/state-fuzzer.git.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Pan Yan developed the theory and methodology, provided the software, and wrote the original draft. Zhu Yuefei supervised the study. Liang Jiao performed validation. Lin Wei performed investigation.

Acknowledgments

This work was supported by the National Key Research and Development Project of China (2019QY1300). The authors would like to express their gratitude to EditSprings (https://www.editsprings.cn) for the expert linguistic services provided.