Abstract

The coverage of test cases is an important indicator for the security and robustness test of industrial control protocols. It is an important research topic to complete the test with less use cases. Taking Modbus protocol as an example, a calculation method of case similarity and population dispersion based on weight division is proposed in this paper. The method can describe the similarity of use cases and the dispersion degree of individuals in the population more accurately. Genetic algorithm is used to generate and optimize test cases, and individual similarity and population dispersion are used as fitness functions of genetic algorithm. Experimental results show that the proposed method can increase the population dispersion by 3.45% compared with the conventional methods and effectively improve the coverage of test cases.

1. Introduction

The industrial control systems control the data collection, image and sound signal processing, information transmission, and process control during the entire production process. The safety and reliability during operation are related to the stability of the entire system. In recent years, with the rapid popularization and application of computer networks, the traditional industrial control system is gradually developing towards the direction of interconnection and intelligence, and some new concepts such as Internet of things, industrial Internet of things, and industry 4.0 are proposed. However, the Internet has injected new vitality into the industrial control system but also brought the same challenges [13].

In security system of the industrial control system, protocol is an important guarantee for the secure transmission of information. Attacks against the protocol are one of the most common methods because of low cost, and, with the rapid development of network, remote attack becomes possible [4, 5]. As the information transmission medium of industrial control system, it is necessary to mine possible vulnerabilities of industrial control protocol through automated testing method to ensure its security and stability.

At present, the commonly used vulnerability mining techniques are divided into static analysis, dynamic analysis, binary comparison, fuzzy testing, and so on [612]. Fuzzy testing has the advantages of high automation, low system consumption, low false-alarm rate, and being independent of the source code of the object program [7]. The key step in fuzzy testing is test case generation. Traditional fuzzy testing often blindly mutates a part of normal test cases when generating test cases; this blind mutation method makes the scale of test cases reach 100000 or millions, but the test effect is not ideal. Therefore, the design and improvement of test case generation strategy are one of the hot research contents of fuzzy test technology.

Test case generation algorithms for fuzzer can be divided into three categories: generation-based method, mutation-based method, and combination of the two methods [1317]. In the current protocol testing, there are some irrationalities in the coding method and similarity determination of test cases, which will affect the coverage of the test, and it needs to be improved. Therefore, we compared the advantages and disadvantages of the three methods, combined with the data packet structure characteristics of the test protocol, and propose a new method based on weight division to calculate the case similarity and the use case average similarity. The goal is to generate use cases with better coverage and improve test efficiency. Compared with the existing literature, this paper has the following major contributions:(i)A new method to determine the similarity of use cases and the concept of population dispersion are proposed, which provides a new idea and method to improve the use case coverage in the process of protocol testing.(ii)Different weight and distance calculation methods are set according to different protocol fields, so the similarity can be determined more accurately according to the function and data content of the use case. The change of coding method also solves the problem of inaccurate similarity judgment caused by data mutation.(iii)The genetic algorithm is used to generate the use case, and the similarity and the population dispersion of the case are used as the fitness function of the genetic algorithm. Automatic optimization of the use case generation is realized.

The rest of this paper is organized as follows: In Section 2, we discuss the related work. In Section 3, we provide an introduction to Modbus protocol test case design method. Section 4 is about the computing method for test cases average similarity and population dispersion. Section 5 contains simulations and results and evaluates the results based on the requirements, while Section 6 draws conclusions and reviews based on the results.

2.1. Generation-Based Method

Generation-based method is to build mathematical model according to the protocol specification of test object and then generate test cases automatically. Martins et al. [18] describe a tool called ConData used as test generation for communication protocols specified as extended finite state machines. The strategy for test generation combines different specification-based test methods. Although the values for fields of interactions are automatically generated, the human intervention is always needed to determine more suitable values for test case purposes. Banks et al. [19] present SNOOZE, a tool for building flexible, security-oriented network protocol fuzzers. SNOOZE implements a stateful fuzzing approach that can be used to effectively identify security flaws in network protocol implementations. But SNOOZE is not evaluated using the code coverage metric. Li et al. [20] present an automatic vulnerability discovering method that combines automatic Protocol Reverse Engineering technology and Fuzz Testing. The method is a four-step program involving packets clustering, multiple sequences alignment, special fields recognition, and fuzzer production, which find the structure of network packets and pursue Fuzz Testing. However, the effectiveness of the proposed method depends on the diversity of the sampling packet itself, so it is necessary to sample the network protocol multiple times and try to ensure that the network protocol is used with different parameters each time. Voyiatzis et al. [21] present the design and implementation of MTF, a Modbus/TCP Fuzzer. The MTF incorporates a reconnaissance phase in the testing procedure so as to assist mapping the capabilities of the tested device and to adjust the attack vectors towards a more guided and informed testing rather than plain random testing. The disadvantage is that Modbus/TCP Fuzzer should be redesigned for different implementations of the Modbus protocol. Liu et al. [22] proposed a heuristic network protocol fuzzy test case generation method based on the heuristic search algorithm and classification tree thought. The Peach and FTP are selected as the verification platform and target protocol, respectively. The test result verified the feasibility and effectiveness of fuzzy test case generation method of heuristic network protocol. However, the coverage of test cases in this paper depends on the accuracy of network protocol classification tree construction. Felix et al. [23] introduced a novel fuzzer, Policy Generator (PG). PG utilizes a number of heuristic techniques to improve space coverage over existing fuzzers. The empirical study demonstrates that PG generates superior coverage compared to current generation techniques. However, many of the metrics correlate and care needs to be taken when interpreting the presented data. In addition, while it is believed that the experimental framework describes this evaluation accurately, the analysis cannot be safely generalized beyond the grammatical expression of the generic firewall policy utilized in this article. Liu et al. [24] propose a vulnerability mining method combining protocol reverse analysis and fuzzy method. An improved effective counting method based on local greedy algorithm is proposed to improve the accuracy of protocol keyword extraction by 65%. Combining the lossy counting method to construct a protocol syntax tree reduces the number of spanning tree nodes by 40%. Although the performance of the proposed method is better than traditional method, it still needs to be improved in terms of operation efficiency and applicability. For example, due to the NLP method, the performance will decrease significantly while extracting keywords for pure binary protocol reverse analysis.

The main advantage of generation-based method is that the same set of test cases can be used directly for the same test objectives, and the generated test cases have high coverage [25, 26]. The main disadvantage of generation-based method is that it takes a lot of time and effort to complete the understanding of file format or protocol specification and the writing of rules. Different target types of software differ greatly. It is difficult to reuse and has a small scope of application [25, 26].

2.2. Mutation-Based Method

Mutation-based method is that a new generation of test cases is generated by mutation strategy designed based on the existing input samples. Gu et al. [27] propose a novel message matrix perturbing mode to generate test case through data mutation for application layer protocol. Additionally, a new statistical keyword extracting technique with priority recursive splitting pattern is introduced to provide useful information for intelligent data mutation. The work presented in the paper is not perfect at several aspects. First, the static statistical analysis just finds a balance between extracting performance and computational complexity. Second, the keywords with low occurrence frequency cannot be grasped through the current method. Last but not the least, the discrimination on different protocol elements is not explicit enough for intelligent fuzzing. A test case generation technique based on mutation algorithm of precaptured IPC data is introduced in [28] in order to improve the fuzzing test efficiency. Two high-risk vulnerabilities are detected in Android 5.1.0. Analysis of these vulnerabilities highlights a critical design issue in the system services of Binder mechanism. The test case generation algorithm needs to be improved leveraging program analysis technique. Lai et al. [29] proposed a vulnerability mining method for industrial control network protocol based on fuzz testing. Protocol feature values were generated by testing cases variation factors for industrial control network protocol, each of which represented a type of ICS vulnerability features. Different test cases were generated by Modbus TCP features and variation factors. Through bypass monitoring method and Modbus TCP features relation between request and response, the difficult problem of determining the validity of testing cases was solved. However, the learning results of industrial control private protocol feature learning method will produce uncertainty due to different data sets. If the characteristics of private protocol need to be analyzed deeply, some manual analysis needs to be done. Cai et al. [30] give a fuzzy security test method based on the grammatical model and propose a grammar model for industrial control protocol based on high-order attribute grammar. The model proposes a fuzzy security test algorithm, combined with the characteristics of the industrial control protocol, and elaborates on the analysis tree structure, test case generation, and mutation strategy. The model performs comparative experiments by simulating Modbus/TCP communication which verifies that anomalous results can still be found at a lower time cost when generating fewer test cases. Accuracy of description model for the industrial control protocol based on subjective understanding will impact test case coverage. Xu et al. [31] proposed the use of deep learning technology to assist test case generation. Using the advantage of recurrent neural network to deal with character text sequences, it learnt training structure features through sample data, predicted new data that conformed to structural features, and constructed an automatic generation model to combine with random mutation algorithm. In order to make the test case generation more targeted and easier to trigger exceptions, the appropriate deep learning network should be studied to learn the auxiliary weight knowledge such as the characteristics of vulnerable points and the oriented distribution of anomalies. A fuzzing test data generation method was proposed in [32] based on dynamic construction of mutation strategy. The method was designed to use the feedback information of instrumentation to dynamically construct the control mutation strategy and the keyword mutation strategy and to guide the fuzzer to generate test data with high coverage. However, the test effect of this method is not ideal for the target program with large input. Dynamic construction mutation method needs repeated exploration of test data and program structure. If the test data is large, it will increase the exploration time and reduce the efficiency of test data generation. Lyu et al. [33] present a novel mutation scheduling scheme MOPT, which enables mutation-based fuzzers to discover vulnerabilities more efficiently. MOPT utilizes a customized Particle Swarm Optimization (PSO) algorithm to find the optimal selection probability distribution of operators with respect to fuzzing effectiveness and provides a pacemaker fuzzing mode to accelerate the convergence speed of PSO. Yue et al. [34] present a knowledge-learn evolutionary fuzzer based on AFL, which is called LearnAFL. LearnAFL does not require any prior knowledge of the application or input format. Based on our format generation theory, LearnAFL can learn partial format knowledge of some paths by analyzing the test cases that exercise the paths. Then LearnAFL uses this format information to mutate the seeds, which is efficient to explore deeper paths and reduce the test cases exercising high-frequency paths compared to AFL.

The main advantage of the mutation-based method is that this method does not need to understand the structure and format of the current sample file, so it can be widely used [25, 26]. The main disadvantage of the mutation-based method is that it is highly dependent on the initial samples. Different initial samples will bring different code coverage, test depth, and test effect, so the efficiency is low [25, 26].

2.3. Combination of Two Methods

Hodován et al. [35] present Grammarinator, a general-purpose test generator tool that is able to utilize existing parser grammars as models. Since the model can act both as a parser and as a generator, the tool can provide the capabilities of both generation-based and mutation-based fuzzers. The presented tool is actively used to test various JavaScript engines and has found more than 100 unique issues. Grammarinator can exploit the fact that the same grammar that can generate new tests can also be used to parse existing test suites and then create new content resulting from their recombination or mutation. The tool has proven its usefulness in the hardening of real-life projects by revealing more than 100 valid unique issues. Atlidakis et al. [36] introduced Pythia, the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing. Pythia’s mutation strategy helps generate grammatically valid test cases and coverage-guided feedback helps prioritize the test cases that are more likely to find bugs. Pythia is the first fuzzer that augments grammar-based fuzzing with coverage-guided feedback and a learning-based mutation strategy for stateful REST API fuzzing.

A new test case generation method based on the advantages of the above methods is proposed in this paper. Firstly, the characteristics of general transmission message of industrial control protocol are analyzed, test cases are designed based on the construction of description model, and coding method of use cases is designed for genetic algorithm. Secondly, genetic algorithm is used to generate and optimize use cases, which realizes the automatic iteration and update of use case population. Finally, in order to improve test coverage and vulnerability discovery rate, the concept of dangerous point is proposed, and, based on this, a composite fitness function is designed to monitor and adjust the state of use case population.

3. Modbus Protocol Test Cases Design

3.1. Message Feature Analysis and Encoding

Choosing appropriate encoding method of use cases for protocol testing can reduce the time complexity of generating test cases and complete the conversion from encoding files to data packets faster. Figure 1 shows the data fields contained in the data packets of Modbus communication protocol and the byte length of each field [37].

In Modbus protocol packets, because the transmission identifier and protocol identifier are independent of the packets’ content, these two fields cannot be considered when constructing test cases [38], so each test case can be mathematically expressed as in the following equation:where is the length of the data field, and its value matches the data length contained in the following three fields. is the address identifier, and value range is 0 to 255. is the function code, which is divided into public function code and user-defined function code in Modbus, and its value range is 1 to 127. is a data field, and the data information of this field depends on the function code.

When encoding test cases, binary encoding is the most common encoding method, and Hamming distance can be used to measure similarity between two test cases, as shown in the following equation:where and denote the i-th characters of the strings and ; means to judge whether and are the same; when they are the same, ; when they are not, .

However, when comparing the similarity of two test cases to calculate the Hamming distance, the Hamming cliff problem may occur [39]. Therefore, Gray code is used in this paper, which can effectively avoid the Hamming cliff problem and realize a more accurate description of the similarity of protocol packets. Assuming that there is a binary code of and its corresponding Gray code is , then the value of the two codes satisfied the following equation:where are the i-th bits of binary code and Gray code and is XOR operation.

Figure 2 shows the effect of clustering on the same set of data when calculating distance using two different encoding methods. It can be seen from the figure that some data may not be able to find the cluster center accurately when using binary code (Figure 2, left) to calculate the distance, while Gray code (Figure 2, right) can effectively avoid this problem.

In summary, Gray code avoids the Hamming cliff problem in binary coding, so the similarity between two Gray-coded strings can be described by the number of different bits, namely, Hamming distance.

3.2. Method for Calculating Similarity of Test Cases with Weights

In the Modbus protocol, the length of each field of the message sequence is basically fixed, but the length of the data storage field is dynamic, and the function of each field and the impact on the security of the message are different. Some fields are related to each other. If the Hamming distance is directly used as the similarity determination between the encoded strings of the two test cases, there is a certain irrationality.

In order to solve these problems, a weight distance calculation method based on internal classification is proposed in this paper. The weight of different fields is set in different value, and the distances of different fields are calculated according to corresponding functions. The data segment is special, because it is related to other fields, and a unique design is required to calculate the relevant distance. The weight coefficient of each field is determined by Analytic Hierarchy Process (AHP).

Assuming that there are test cases A and B, first calculate the corresponding distances of each functional field of them, then combine the weights of the fields, and calculate their overall similarity. The final calculation formula is shown in following equation:where is the weight of each field. and are the corresponding fields of the two test cases, and is the distance between the two corresponding fields. The distance calculation method for different fields is slightly different.

The pairwise comparison matrix determined by Analytic Hierarchy Process is shown in the following equation:

Consistency of the pairwise comparison matrix was checked. If test coefficient , then consistency check is passed. The calculated weight of each field is shown in the following equation:

According to the characteristics of the Modbus test cases, the length of length field, address identifier field, and function code field are fixed, while the length of the data field is dynamically variable and is associated with other fields. Therefore, when calculating the distance between the corresponding fields of the two use cases, two different methods are used to calculate the distance of the fixed-length and variable-length fields. For fixed-length fields, the Hamming distance can be directly used.

The length of the data field is dynamically variable. When describing the distance, Hamming distance will have a large deviation, and Levenshtein distance can solve this problem. Levenshtein distance is to find the minimum number of transformations required to convert string A to string B. It can more describe the difference between two strings of different lengths accurately. The calculation method is shown in the following equation:where and are the subscripts of string to string . is the maximum value. is the minimum value.

Therefore, the similarity calculation equation (4) of the two test cases can be further optimized into the following equation:where is the Levenshtein distance between the two data fields of and .

4. Average Similarity and Population Dispersion of Test Cases

In the test case generation process, the iteration is based on the population, so it is necessary to describe first-generation population from the perspective of the whole population. Here, the average similarity of population test cases is designed to describe the population state. The average similarity of test cases refers to the overall degree of dispersion among individuals in a population. When the average similarity of test cases is low, it means that the overall similarity of individuals within the population is too high, and the coverage of test cases is low [40]. At this time, the parameter information in the test cases generation process, such as the mutation probability and the similarity threshold, can be appropriately changed to adjust the distribution of the generated test cases and improve the coverage of the test cases.

When describing the average similarity of test cases of individuals in the entire population, it can be described by the average distance between individuals. This method is feasible to some extent, but each individual needs to calculate the distance between itself and all other individuals. As a result, this method has a lot of repeated calculation and low efficiency. In addition, if an extremely uniform edge distribution occurs, it will also lead to misjudgment. Therefore, the concept of average similarity of test cases is proposed in this paper, and a new calculation method is designed to accurately reflect the distribution of individuals in the population and reduce the amount of calculation.

Firstly, values of individual fields in the population are normalized, which is expressed mathematically in the following equation:where is the maximum value of the field in the population; is the minimum value of the field in the population.

The sum of each field is averaged to calculate the mean center test case, as shown in equation (10), and the calculation method of each field is as in equation (11).where is the total number of test cases in the population and is the current field of the test cases.

The similarity between the test cases and the central test case can be used to indicate the outlier degree of the test cases, as shown in the following equation:

The calculation time complexity of the average similarity of the test cases is ; compared with the time complexity of the general method, there will be a significant efficiency improvement when is larger. Then the dispersion of the whole population can be described by the following equation:

5. Experimental Evaluation

By designing the encoding method and the similarity calculation method between test cases, combined with the description of the average similarity of test cases in the test cases population, theoretically, it can effectively improve the efficiency of test cases generation and increase the coverage of test cases. In order to verify the correctness of the proposed method, a set of comparative experiments are designed, and genetic algorithm is used as the core algorithm for test case generation. The encoding method, individual similarity, and average similarity of test cases are calculated by the proposed method and the conventional method, respectively, and the test cases generated by the two methods are compared and analyzed.

Genetic algorithm is an intelligent optimization algorithm, which is often used to find the global optimal solution, and we adjust the population optimization direction by designing the corresponding fitness function. In the test case generation method designed in this paper, the population convergence direction of genetic algorithm is a suspicious case in historical data. Suspicious test cases are cases that cause test target anomalies during the test process. Taking these cases as the convergence center of next genetic algorithm can effectively reduce the randomness of test case generation. These test cases are called “suspicious points.” Based on this, the fitness function of the genetic algorithm designed for two sets of experiments is shown in the following equation:where is the similarity between the test case and the suspicious point; the calculation method is shown in formula (3). is the average similarity of the test case.

The meaning of fitness function is that when there are suspicious points in the population, the population converges to the suspicious case. When there is no suspicious point, the population with higher average similarity of test cases is preferred. Other parameter settings of genetic algorithm are mutation probability and crossover probability .

The whole experimental procedure designed is shown in Figure 3. Firstly, the initial test case population for the two experiments is constructed manually, and the initial population is encoded according to the encoding method mentioned above. Secondly, the initial population is input into the test case generation module, and two different fitness function calculation methods are used to generate and optimize the test cases. Finally, the result monitoring module records the operation results.

The script development language of the experiment is Python 3, and Modbus communication simulation software used in the test is Modbus Poll and Modbus Slave. Firstly, Modbus Poll is used to establish data communication with Modbus Slave, Wireshark packet capture tool is used to obtain normal communication messages, and representative data messages are selected to analyze the data characteristics and construct the initial population. Secondly, the initial population is sent to the test cases generation and optimization module to iterate, optimize, and update test cases. Finally, each generation of population is sent to the target for testing. Statistical analysis was performed on the test cases data generated by the two methods. During the experiment, the average similarity of the first 5000 generations of population test cases was calculated. The results are shown in Figure 4.

In two groups of experiments using different methods, during the population iteration process, the dispersion gradually increased and eventually stabilized. At the beginning of the experiment, since the same initial population was used, the dispersions of two groups were the same. However, with the iteration of the population, when both of them are stable, the dispersion of the population produced by the improved method is 3.45%, which is higher than that of the conventional method. It is generally believed that the higher the dispersion between individuals within a population, the higher the coverage of test cases [21]. Therefore, it can be considered that the coverage of test cases generated by the improved method is higher than the conventional method, and it also proves that the method proposed in this article has certain advantages over the conventional method. Based on the proposed method, we design a fuzzy tester [41].

6. Conclusion

A new test cases similarity determination method and the concept of population dispersion are proposed in this paper, which provides a new idea and method for improving the test cases coverage in the protocol testing process. In the determination of test cases similarity, different weights and distance calculation methods are set according to different protocol fields, which can more accurately determine the similarity according to the function of the test cases and data content, and the change of the encoding method effectively resolves the problem of inaccurate similarity determination caused by data mutation. The genetic algorithm is introduced into the test cases generation algorithm, and the test cases similarity and population dispersion are used as the basis for constructing the fitness function of the genetic algorithm, and the automatic optimization of the test cases generation is realized. The test cases data generated in the experiment shows the effectiveness of the method. Our planned future work is twofold. First, we plan to improve the applicability of the method and apply it to the generation of test cases for other protocols. Second, we plan to optimize the time complexity of the algorithm.

Data Availability

The data used to support the findings of this study have not been made available because the generated test cases were not backed up in time.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by “National Key R&D Program of China” (2018YFB2004200), the open project of Zhejiang Lab “Construction Technology of Local High Security Trusted Execution Environment for Edge Intelligent Controller” (2021KF0AB06), and the National Natural Science Foundation of China “Research on anomaly detection and security awareness method for industrial communication behaviours” (61773368).