Abstract

Obfuscation of software and data is one of the subcategories of software security. Hence, the outlines of the obfuscation problem and its various methods have been studied in this article. This paper proposes a hybrid of two signals and encryption obfuscation to hide the behaviour program and prevent reconstruction of the normal code by hackers. The usual signal method is strong enough for obfuscation, but its problem is the high complexity because of a lot of call and return instructions. In this study, a new dispatcher was added to the source code to reconstruct the original control flow graph from the hidden one to solve the problem of the signal method. This dispatcher code is encrypted to preclude access by the hacker. In this paper, the potency that makes the obfuscation strong has been increased and the resilience that makes the obfuscation poor has been decreased. The results of a comparison of the similarity among the ambiguous data with its original code and with available efficient methods present a performance advantage of the proposed hybrid obfuscation algorithm.

1. Introduction

With the rapid expansion of the Internet and its influence on all aspects of social, cultural, scientific, economic, and political exchanges, the most important challenge facing cyberspace is the security threats to these exchanges therein. Anything that can lead to a dangerous event has become a security threat in cyberspace. The origin of security threats falls into two categories: people (human factors) and software. Each one of these has its subcategories. In the field of threats posed by human factors, we face five factors, including Red/Black Hat hackers, dissatisfied employees, domestic competitors, foreign competitors, and foreign states, while threats based on software factor, which are applied applications, can be risk factors in two ways and endanger the security of information: vulnerable applications and malwares.

Based on the performance and behaviour of malwares, these can be divided into four groups: virus, worm, Trojan horse, and botnets. Obfuscation is an invasive technique that a malware writer considers to apparently hide his malware. This means that it is done by changing the appearance of the malware source code and maintaining the functional nature of malware. It attempts to be secured by antivirus detection and continues its destructive activities. Obfuscation as an invasive technique can also be used as a defence solution in the field of software and vital information protection against security threats. Malware obfuscation is studied in this research since the access to obfuscation information and software for research is difficult or even impossible because of its confidentiality [1].

Different obfuscation methods have been presented (Figure 1). One of the obfuscation methods is adding the dead code that alters the look of the program code. The implementation of this method is easy, but the disadvantage of this approach is that it is recognized by eliminating additional commands [25].

Another obfuscation method is changing the names of the registers so that these will change from one generation to the next one but can be recognized by renaming the registers [6, 7]. Replacement command, which generates the separation code, is a type of obfuscation method [1, 6]. Another form of obfuscation is the shuffle code. In this method, the initial order of commands downs and the cost of detection goes up. But the implementation has problems: it can be recognized by eliminating the nonconditional commands [6, 7]. Code integrating is also an approach for obfuscation, but its disadvantage is that it is difficult to implement. The advantages of this method are the crucial diagnosis and recovery [4, 5].

Displacement and handling of the subroutine is one obfuscation method: its advantage is that it obfuscates the source of the program code and downs the order of the subroutine. The disadvantage of this method is that the code is detected by changing the subroutine [1, 2].

Another type of the obfuscation method is encryption. The advantage of this method is that the main pattern of the program code is hidden. The disadvantage of this method is that the malware could identify it using the code decoding.

The signal method is an obfuscation method in which control flow graph opcodes (operation codes) are hidden [810]. The advantage of this method is that it hides the control flow graph of a program and makes the information of the control flow graph of the program difficult [1113]. The disadvantage of the signal method is the high cost of operations due to the high number of call and return instructions [10, 1416].

This study proposed a hybrid signal and encryption (proposed S&E) method. In the usual signal method, the control flow graph of the program is hidden. However, the cost of the operation is increased in the usual signal method.

In the proposed method, first, the information of the control flow graph is hidden by the signal method, and then, the dispatcher of the signal, which reconstructs the original control flow graph from the hidden one, is added as new information to the file and encrypted to preclude access by hackers. The hybrid method does not have the disadvantage of the signal method, which has a high cost because of the organization of signals by the operation system. In other words, to preclude overloading of the operating system, it has been suggested that the dispatcher be encoded in the program. A further explanation has come up in the proposed algorithm. In addition, since an elaborate formula to calculate the complexity and resilience of obfuscation techniques is not provided in previous studies, new and transparent formulas are presented in this work.

In this article, the obfuscation outlines of the problem are taken into consideration in Section 2. In Section 3, a hybrid of the signal and encryption method is presented. The results of the implementation of the algorithm are described in Section 4, and finally, in the last section, conclusions have been presented.

2. Description of the Problem

Obfuscation is a set of methods that can be used by malware writers or software to turn one program with the same behaviour but with a different appearance to another one [1]. It consists of three objective functions and six variables, which are described in brief. Three objective functions include (a) potency, (b) resilience, and (c) cost. The potency can be considered a useful measure of the change that causes the encryption purpose of the program to be hidden. Also, the potency is considered an indicator of the obfuscation productivity measure for people. Resilience can be considered an obfuscation productivity measure for machines automatically (in opposition to potency). Cost measures the time complexity.

Among the six variables affecting the objective functions, we can point to (program time), the number of operators and operands of the source code. The second variable (complexity) specifies the number of conditional statements of the source code and (complexity nesting) is the maximum depth of nested statements in the source code. (complexity of information) is the undefined variable of the program. (complexity fan-in/out) is the number of called functions. The last variable (complexity data structure) is the number of defined variables in the program [6].

The potency to change the behaviour of a program that is represented as is the function that shows the lack of change in the behaviour of the obfuscated program to the source program and is affected by the complexity measure function [6].

3. The Proposed Algorithm

In this study, the goal is the implementation of obfuscation in the signal method and encryption method in order to increase the complexity level and reduce the detection potency. At first, the signal obfuscation method is used so that the tree- and graph-like structures of the program become star structures. For example, in normal status, the applications have a graph-like structure; this graph is created on the basis of the structure of function calls, but in the case of using a signal obfuscation or socket, all requests and communications among the functions are done by sending signals. After a signal is created, the operating system organizes these signals. So, in any communication of functions, first, a signal must be sent to the operating system, and then, the operating system sends target signals into a function (or program). This structure causes the graph structure of the program to turn into a star structure. Though this structure makes the control graph of the program unclear (advantage), it increases the cost of the program (disadvantage). So, it is recommended that the dispatcher of this star structure be in its own program in order to avoid overloading the operating system. In the next step, we encode this part that is there in the program to prevent its hacking (we use obfuscation in the encryption method). The first letter of the word “Signal” and the first letter of the word “Encrypt” were chosen to name the algorithm “S&E” because this algorithm is a combination of both. The S&E procedure algorithm is in a way such that, in the first step, functions in the source code are run line by line.

3.1. A New Approach to Calculate the Potency Function

The complexity function is defined by affecting six parameters that are not defined precisely in previous studies; therefore, in this article, due to the effect level of each variable listed on the complexity of the program, three complexity functions have been proposed in this study. In that, one of them has been chosen for implementation. According to the present six variables in the previous section and given the lack of a precise definition of the relationship between these variables in their functions, three proposals have been presented in this article. Various methods have been studied, and the following three functions which show better performance compared to the others are selected:

The variables are divided into two groups: important and more important groups, due to the complexity effect of making decision variables. The , , and variables are placed in the more important group, and , , and are placed in the important variable group. We consider the variables with the same weight in (1) and the sum of these three variables to measure the complexity as . The sum of weighted three more important factors to measure the complexity is indicated as in (2). For (3), we consider a sum of important and more important variables as . However, the effects of important factors in comparison with the most important factors can be neglected. Thus, in this study, we considered just most important factors for our measurement. Therefore, the complexity can be measured as follows:where is the obfuscated program and is the original program. If , obfuscation is strong and is disturbing for people, while deobfuscation is very simple for the device. The complexity of the application increases according to some used metrics. Thus, the potency can be considered a useful measure of obfuscation to people.

3.2. A New Approach for Calculation of Resilience

To measure the effectiveness of obfuscation to automatic deobfuscators, resilience is introduced. Resilience takes two parameters into account: programming attempt (the amount of time it takes to build a program that can remove the program from being obscured) (automatic deobfuscators) and attempt of the deobfuscator (run time and memory space required to remove the program from being obfuscated). The potency is in contrast to resilience because the potency focuses on making the application more complex or, in other words, increases the potency, whereas in resilience, it is paid to decrease the resilience because the later the program is identified, the better it is. Resilience of the change of a program behaviour which is displayed as is a function that shows the lack of behaviour change of the obfuscated program to the source program (Martinez [6]). Resilience function is defined by six parameters that are not defined precisely in previous resources. Therefore, in this article, according to the effect level of each variable in the resilience program, two complexity functions have been proposed for the implementation. is affected in an automated manner by the measurement function, execution time, required memory, and amount of time it takes to build a deobfuscator. Two ways are suggested to calculate it as follows.

According to the impact of making decision variables on the measure of resilience, we chose two groups of variables such that their impact causes reduction of resilience. According to what has been said in this article, resilience can be measured as follows:

If , the resilience is low and obfuscation is strong. Note that when , then . It means that resilience of the obfuscated program is less than that of the original program and obfuscation is strong.

3.3. Calculating the Cost Function

The program code may require more storage space or more time to finish after changing for obfuscation. This concept is introduced as the cost of changes. The cost of changing the behaviour of an application, which is displayed as , is a function that shows the lack of behaviour change of the obfuscated program to the source program . is affected by measurement function of complexity of run time . can be compared in the following ways [6]:(i)It is very high costly if the implementation of requires an exponential amount more than .(ii)It is high costly if the implementation of requires an amount of more than where .(iii)It is of low cost if the implementation of requires an amount of more than .(iv)It is of no cost if the implementation of requires an amount of more than .

The quality of changes is a combination of the obfuscation quality of potency, resilience, and cost, which is displayed as follows:

4. Simulation and Results

The proposed hybrid signal and encryption method was tested on a computer with Intel core 2 Duo CPU and 1 GB RAM. The code is written in C++, and standard data are used for testing and comparing the proposed algorithm. It includes 30 viruses from the VX Heaven public dataset [17]. It consists of 10 viruses from the Second Generation Virus Generator (G2) (published in January 1993) and 20 viruses from the Next Generation Virus Construction Kit (NGVCK).

4.1. How to Evaluate the Data

We use Mishra’s method [18] to compare two pieces of code. Mishra proposed a method that allows one to compare two assembly programs by assigning a score to it. It represents that the two programs are similar. The Mishra method involves the following steps:(1)Two assembly programs X and Y are supposed, and we derive strings of opcodes except description, the empty line (distance), tags, and other orders. The result is identifier sequence lengths n and m, where n and m are the numbers of opcodes in the programs X and Y. Opcodes, respectively, have their identifiers’ sequence in each phase.(2)We compare two identifiers’ sequence by considering all sequences (subsequences) of three consecutive opcodes for each step. We count the match of each case regardless of the sequences where all three opcodes are similar and marked in a coordinate graph (x, y).(3)After comparing two sequences’ opcode and marking all the matching coordinates, we gain a plotted graph on a grid of size n ∗ m. The numbers of identifiers of the program X on the x-axis are shown and those of the program Y are shown on the y-axis. To reduce interference and random matching, we keep only that part of the line (string) of the length greater than a threshold value (in this study, the threshold is considered to be 5).(4)As we are doing a continuous correspondence between the two identifiers, the same section of the section-by-line opcode to the core diameter will be formed. If a segment is in the fall core diameter, it is a match. In fact, the places in two identified fields are the same. A diagonal diameter line indicates that the match of opcodes appears in different places in two files.(5)For each axis, we determine a fraction of opcodes that are covered by one or more segments. A similarity score for two programs is gained from these parts. The similarity metric calculates the similarity between the original program () and the obfuscated program (). This metric is 1 when and are similar, and this metric is 0 when there is no similarity between and . For example, the similarity score equal to 0.01 shows low similarity and good obfuscation and 0.85 shows high similarity and bad obfuscation.

4.2. Results and Discussions

The results of the comparison of the obfuscation viruses in two groups NGVCK and G2 with their original code, based on the potency, resilience, and cost metrics, are presented in Table 1. The comparison results for original and obfuscated codes with the Mishra criteria are also shown in Table 2. The results of presented metrics in Tables 1 and 2 show consistency of the presented metric with the Mishra metric.

To compare the obfuscation algorithm level proposed in this paper (S&E) with some of the available efficient methods, the minimum, maximum, and average of the similarity level of obfuscated viruses with the initial code for the proposed S&E and other algorithms are shown in Table 3. Obviously, the minimization of the similarity of obfuscated viruses by the proposed S&E is less than that in other algorithms. However, the proposed S&E has not been successful in reducing the average rate of similarity for viruses of NGVCK and G2 in some cases.

The comparison between the proposed algorithm (S&E) and other algorithms includes sliding window of difference and control flow weight (SWOD-CFW) [19], annotated control flow graph (ACFG) [20], chi-squared distance (CSD) estimator [21], hidden Markov model (HMM) [22], substitution distance (SD) [23], opcode graph similarity [24], opcode histogram [25], opcode sequences [26], and opcode patterns [27]. In general, Table 2 shows the excellence of the proposed S&E, in comparison with the other obfuscated methods.

5. Conclusion

In this study, a hybrid signal and encryption obfuscation method was presented. The proposed algorithm used a signal method to change the tree- and graph-like structure of the program into the star structure and hide the control flow graph of the problem. The problem of the signal method is high number of call and return instructions. This study suggested adding a dispatcher to the program that converts the signal program to the original control flow graph. In this way, the problem of the signal method was solved. This dispatcher was encrypted to keep it secure from hackers. Furthermore, a new approach has been suggested to measure complexity and resilience. Five functions were offered in order to calculate the values of complexity and resilience. The results of the comparison of obfuscated data similarities with the initial codes, based on Mishra’s method, represent a performance advantage of the proposed and hybrid algorithm obfuscation.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.