Abstract

In the hardware Trojan detection field, destructive reverse engineering and bypass detection are both important methods. This paper proposed an evolutionary algorithm called Ordered Mixed Feature GEP (OMF-GEP), trying to restore the circuit structure only by using the bypass information. This algorithm was developed from the basic GEP through three sets of experiments at different stages. To solve the problem, this paper transformed the GEP by introducing mixed features, ordered genes, and superchromosomes. And the experiment results show that the algorithm is effective.

1. Introduction

At all stages of the life cycle of integrated circuits (IC), there are security vulnerabilities for hardware in the global business model of semiconductor supply chain. In the current hardware Trojan detection technology, destructive reverse engineering [13] is good but costly bypass detection [49] is the technology whose cost is low, which is a development key direction at present.

An evolutionary algorithm called Ordered Mixed Feature GEP (OMF-GEP) is proposed in this paper. This algorithm takes a single-circuit component as a node to form a mixed feature of various logical or physical features of the node. And it can find the original circuit by using the GEP function regression ability.

2. Mixed Features

The logic value of a circuit or any kind of bypass information such as voltage and current can be considered as a manifestation of a characteristic of the circuit. When only one characteristic representation value is used to represent the circuit, if other constraints are not added, there will undoubtedly be a variety of circuits to meet the requirements of this single feature. The two circuits are shown in Figure 1.

If you look only at the logical values, the two circuits are completely equivalent, and both of which are

You can also see that in the circuit in Figure 1(b), inputs A and B are used at all; that not, the input s A and B in the first circuit do not actually affect the output.

This example is only the value of the circuit logic value and the circuit bypass information detection, and there is a similar situation. Multiple different circuit structures can be obtained for the detection results of any single bypass information. It can be seen that only using logical values or bypass information to describe the circuit will lead to too much isomorphism to confirm the circuit structure.

The essence of hardware Trojan horse design is to add additional circuit to normal circuits, but the performance of the whole circuit on some features (the most common is the logic value) is the same as that of the normal circuit, to realize hiding. However, this additional circuit will inevitably cause other circuit features to change.

In this paper, the logic value of the circuit or any kind of bypass information such as voltage and current is called a feature. For the isomorphism of a feature, it is essentially due to the superposition of the features of the circuit elements on the feature. The features of multiple different circuit structures with the feature are similar or even the same, so that the corresponding circuit cannot be represented by the result of a feature.

Then, when detecting multiple features at the same time, multiple isomorphic circuits can be obtained from the detection results of each feature, but the superposition features of different features cannot be exactly the same. These isomorphic circuits cannot be the same, where the same part is a possible real circuit.

Figure 2 illustrates the application of the algorithm to a diode-designed And gated circuit.

Its logical meaning is

It has a lot of physical meaning. Here is description of its voltage:

Among them, represents voltage of the output position ; and represent the voltage values of two input locations A and B; represents the diode conduction voltage for silicon tubes, and its value is often 0.7.

Then, the And gate can be expressed as

Then, a measurement of k features, n input, and single-output single-gate circuit can be expressed as Expression 1:

Among them, is the output values for the adoption of the kth feature. is the ith input value for the adoption of the feature j.

Without losing generality, let us define

Then, Expression 1 can be expressed aswhich is Expression 2.

3. Algorithm 1: Single-Output Circuit

3.1. GEP Representation

In recent years, there have been many studies based on evolutionary algorithms and multisource data such as data fusion of adaptive weighted multisource sensor [10], the research on evolutionary algorithm for symbolic network [11], the application of the genetic algorithm in multiobjective multicast routing [12], and multiplicity problems in genetic association studies [13]. Zhi and Liu [14] proposed a new GA algorithm for mechanical design optimization problems. These studies gave us the inspiration to use the evolutionary algorithms in the hardware Trojan detection.

Gene expression programming (GEP) [15] is an evolutionary computing algorithm that has performed well in the study of evolutionary hardware [1622]. It can solve the problem of tree structure very well. For the multi-input/single-output tree structure circuit, it can be described as a tree with n leaf nodes, which can be represented directly by GEP. As shown in Figure 3, the 6-input/1-output logic circuit can be easily represented as a tree structure, in which the logic gate function is replaced by the logic symbol, and the corresponding effective gene is

3.2. Algorithm of Mixed Feature GEP

One operator in GEP represents only one kind of calculation, and a GEP individual can only represent one test item, so the idea of algorithm one is to merge the multiple test results of a basic circuit into a function expression. Combined into a compound function, that is, let a function represent multiple calculations and evolve a representation close to the original circuit. Specifically, these multiple detection values are included in a function, the input of the function is multiple values, and the output result is a vector, such as the aforementioned gate circuit, which is still represented as “And” in the GEP expression tree. However, its meaning has become the following vector calculation:

The is the input value of a detection, and is the result of this detection to the input value.

For example, for this gate circuit, the symbol And means the following:

Among them, represent the logical value (1 or 0) of voltage input of the A or B point, represent the input voltage of the A or B point, and represent the input current of the A or B point.

Thus, when using GEP evolution, a symbolic value can simultaneously represent multiple unrelated items. This algorithm will be called Mixed Feature GEP (MF-GEP).

3.3. Experiment Setup

The experiment is limited to the use of simple logic gate circuits, does not involve triggers, clocks, etc., and does not consider time effects.

Four groups of experiments were designed.

Output ,

The number of features are .

Three features are used: feature 1 is the logical value, feature 2 is the voltage value, and feature 3 is the current value.

As a comparative experiment, the parameters used are identical as Table 1 shows.

3.4. Design of Fitness Function

The feature data are logic data, voltage data, and current data, which have their own fitness.

Logical data fitness iswhich is Expression 3.

N is the number of test data, is the logic value calculated according to the test data after decoding, and is the output logic value of the test data. Because it is a logical value, the worst case is that each output decoded by the individual is opposite to the test value, that is, , so is among the range of .

Voltage data fitness iswhich is Expression 4.

N is the number of test data, is the voltage value calculated according to the test data after decoding, and is the output voltage value of the test data. At worst, each test output value is either the highest level or the lowest level, and each output decoded by the individual is opposite to the test value ; therefore, is among the range of .

Current data fitness iswhich is Expression 5.

Among them,where is the data observation, is the estimate value of the which is calculated from the decoding expression, and is the average value of the variable . That is, the SSE is the Sum of Squared Errors and the SST is the Sum of Squares in Total. the square of the multicorrelation coefficient in statistics.

According to the previous algorithm description, the individual fitness should be a combination of the three; then, the individual fitness iswhich is Expression 6: individual fitness expression.

are the weight of three features in the final fitness.

3.5. Experiment

Figure 4 shows a circuit. Its Boolean expression is

Its calculation is

The input value can be seen as the Trojan trigger conditions. When there are more pins, only part of the value can be tested, and you may miss the input . In the following experiment, the input data will not provide the value to trigger the Trojan, and the output of the input value will be determined by the evolved circuit.

Its effective gene is

Using different combination forms, we designed 4 groups of experiments. Considering that the logic value is required to be correct first in the circuit, the voltage value and the current value must be meaningful on the basis of the correct logic value, so the logic value is included in each group of experiments, and the fitness of the individual combines several data; the logical value accounts for a larger proportion. Table 2 shows the results of the experiments.

The experimental results show the following:(1)Only using a single feature cannot find Trojan circuit.(2)Using multiple features can effectively discover Trojan circuits.(3)The features with direct correlation have no effect on the discovery probability of Trojan horse: in experiment 2, two features of logic value and voltage are used at the same time, and the Trojan horse cannot be found; in experiment 4, although three features are used, the probability of finding Trojan horse is not higher than that of real 3. The reason is that in digital circuits, the logical value itself is expressed by the voltage value; for example, the voltage value less than 3 V is considered 0, and the voltage value greater than 3 V is considered 1. Therefore, there is no difference between the logic value and voltage value.

4. Algorithm 2: Multioutput Circuit

4.1. GEP Representation

One circuit n input/m output can be described as a forest composed of m trees, each with 1∼n leaf nodes. The 6-input/2-output circuit in Figure 5 can be decomposed into two tree structured multi-input/single-output circuits.

The circuits shown in Figure 6 can be divided into two independent multi-input/single-output.

The corresponding effective genes are

The combination of the two genes represents a 6-input/2-output circuit.

4.2. Algorithm of Ordered Mixed Feature GEP

The GEP should be modified as the following to be able to represent this kind of circuit.

4.2.1. Remove the Link Function and Number the Gene

The GEP data structure has its own multigene structure. In formula mining, the basic idea of GEP is to use a polynomial approximation method, so that each independent gene can evolve a part of the final polynomial and then use a connection function (usually “+”) to form a complete polynomial. Of course, if the test data are error-free and the cost is sufficient, GEP final expression does not need to be approximated, and it is the expression from which the test data themself come.

The operator such as “+” has a characteristic that there is no sequential difference between the operators. If such an operator is used, it can be considered that there is no sequential difference between the genes in GEP chromosome.

We can also see that GEP can solve a problem similar to function problem; that is, it can deal with the problem of multi-input and single-output. However, circuit combinations are often a multi-input/multi-output problem, that is, a problem as . This is a situation GEP cannot handle by its own algorithm.

For solving the multioutput situation of combinational circuits, the GEP data structure is changed as follows:(1)The connection function used to connect GEP to multiple genes is removed, so that a gene represents an output, and there is no association between genes; then, a chromosome with k genes represents a circuit with k outputs.(2)According to the position number of the gene in the chromosome and the position of the gene in the chromosome, the corresponding output pin is represented; that is, the input value in the GEP is the test value of each input pin. The decoding result represents the circuit structure of an output pin. Each gene within a chromosome evolves independently.

4.2.2. Record the Fitness of Each Gene

The fitness is set for each individual in the GEP, which is the basis for the calculation of various evolutionary variations. The fitness represents the approximate degree of the target on the whole of an individual. This fitness is calculated based on the expression tree decoded by an individual.

In the work of this paper, because each gene in an individual is independent of each other, the whole individual decodes not an expression tree, but an expression forest, and the trees in this forest are still orderly. The fitness of an individual depends on each gene. To solve this problem, the fitness is set for each gene of the individual in the work of this paper. The fitness represents the similarity of the gene to the circuit structure of the corresponding pin. Combined with the fitness of all genes, an individual’s fitness is formed, indicating the approximation of the individual to the whole circuit structure. Therefore, this paper not only sets the fitness for each individual but also sets the fitness for each gene.

4.3. Materials and Methods

The relationship between chromosome fitness and gene fitness can be described aswhich is Expression 7: individual fitness of multigene GEP.

This algorithm is Ordered Mixed Feature GEP (OMF-GEP).

4.4. Experiment Setup

The setting of experiment 2 is most consistent with that of experiment 1. The difference is that the number of genes is increased to 2, corresponding to the operation of gene recombination (the probability is 0.01), and the termination condition of the algorithm is changed to 100,000 times.

The fitness of each gene is calculated in the same way as experiment 1. Expression 5 is used to calculate the fitness of the whole chromosome, in which n is 2.

4.5. Experiment

Use the circuit of Figure 5. In this circuit, Boolean expression is

Among them,

During the experiment, we deliberately hide the test cases that allow X to take a value of 1. The setup of the 4 groups of experiments is completely consistent with that of experiment 1. Table 3 shows the results.

It can be seen from the experimental results that no matter what combination of features is used, after evolution begins, the fitness cannot continue to grow after reaching a very low value, and evolution has actually stopped. The overall trend is shown in Figure 7.

5. Algorithm 3: Superchromosome

In experiment 2, it is impossible to evolve continuously when the fitness is not high in the early stage of evolution. By analyzing the reasons, the fitness of the individual represents the approximate degree of the individual to the whole circuit, but in the design of the modified algorithm, each gene evolves alone. That is, the approximation of the circuit is divided into different parts. In the process of evolutionary calculation, such individuals will be considered poor individuals, with less chance of heredity in the next evolution, resulting in the loss of local genes already leading in evolution. This situation will lead to the efficiency of the algorithm evolution being very inefficient, or even unable to converge, because the evolution has entered a situation of almost random evolution.

5.1. Algorithm Description

To solve this situation, this paper introduces the concept of “superchromosome” in its work. Each individual in the population is obtained by genetic variation after initialization, but the superindividual is artificially constructed. Using the fitness of each gene that has been recorded, a superindividual is constructed after an evolution. The method is that the structure of the superindividual and the ordinary chromosome is the same, but the gene at each position is the best one in the same position in the whole population, as shown below. In this way, the superindividual concentrates the last evolutionary optimal gene at each gene location, and there is no doubt that the superindividual is the optimal individual in the population. Then, replacing the worst individuals in the population with such superchromosome to continue the later evolution can effectively avoid the elimination of local excellent genes. Figure 8 shows how to compose the superchromosome.

The introduction of the superchromosome was intended to avoid the elimination of excellent genes in the same individual due to the existence of “low quality” genes, but it brought an additional benefit. Evolutionary individual fitness changes can often reach a high level soon after evolution, as Figure 9 shows. The reason is that the superindividual concentrates the optimal genes in each gene position, so that the whole individual can achieve very high fitness.

5.2. Experimental Setup

The setting of experiment 3 is the same as that of experiment 2, but it increases the generation of superindividuals when each generation evolves. The fitness function designed is exactly as the same as experiment 2.

5.3. Experiment

The content of the experiment is the same as that of experiment 2. Table 4 shows the results.

The experimental results show the following:(1)After using superindividuals, the evolution can reach a very high level in a very short time, and then the speed of evolution will be significantly reduced.(2)Using only a single feature or feature with direct correlation, the algorithm is difficult to obtain satisfactory fitness, and the reason has been analyzed in the results of experimental 1.(3)The use of multiple features that lack direct correlation between each other helps to achieve higher fitness.

6. Conclusion

GEP algorithm based on the mixed features is proposed in this paper, when multiple features with no direct correlation between each other are used, even if some important parameters are missing in the test case, and the abnormal structure in the circuit can be found. For multioutput circuits, if it only simply decomposed into multiple single-output circuits to evolve separately, the algorithm will fall into a complete random search and cannot converge. In this paper, the concept of superindividual is proposed to solve this problem, so that the algorithm can converge smoothly in the circuit structure facing multi-input.

Data Availability

The data used to support the findings of this study can be obtained from https://pan.baidu.com/s/1z29JUVHv8Qx4-uasESnKMw (pwd: 55vd).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the University Research Start-Up Funds of Chengdu University of Information Technology (KYTZ201720), the Open Project of Center for Information in Biomedicine of School of Life Sciences and Technology, University of Electronic Science and Technology of China (SYFD061902K), and Sichuan Science and Technology Program (2019YFG0196).