Abstract

Mixed Boolean-arithmetic (MBA) expression, which involves both bitwise operations (e.g., NOT, AND, and OR) and arithmetic operations (e.g., , , and ), is a software obfuscation scheme. On the other side, multiple methods have been proposed to simplify MBA expressions. Among them, table-based solutions are the most powerful simplification research. However, a fundamental limitation of the table-based solutions is that the space complexity of the transformation table drastically explodes with the number of variables in the MBA expression. In this study, we propose a novel method to simplify MBA expressions without any precomputed requirements. First, a bitwise expression can be transformed into a unified form, and we provide a mathematical proof to guarantee the correctness of this transformation. Then, the arithmetic reduction is smoothly performed to further simplify the expression and produce a concise result. We implement the proposed scheme as an open-source tool, named MBA-Flatten, and evaluate it on two comprehensive benchmarks. The evaluation results show that MBA-Flatten is a general and effective MBA simplification method. Furthermore, MBA-Flatten can assist malware analysis and boost SMT solvers’ performance on solving MBA equations.

1. Introduction

Mixed Boolean-arithmetic (MBA) expression [1, 2] is defined as the expression that mixes the usage of bitwise operations (e.g., , , , and ) and arithmetic operations (e.g., , , and ). Several formal methods [1, 3] are designed to generate a new complex MBA expression that is equal to a simple expression. MBA expression, which can be used to replace a simple expression with an equivalent representation that is hard to understand, is an advanced software obfuscation scheme [35]. The MBA obfuscation has been adopted by many academic projects and industrial products to protect software [59].

The wide practical applications of MBA obfuscation have attracted research on simplifying MBA expression. Recent studies [10, 11] demonstrate that existing computer algebra software has a very limited effect on MBA simplification. Consequently, multiple methods are proposed to simplify MBA expressions, including bit blasting [12], pattern matching [13], program synthesis [1416], deep learning [17, 18], and table-based solutions [5, 11]. Among them, table-based solutions are the state-of-the-art MBA simplification method. However, one strong limitation is that the complexity of creating and storing the precomputed table is , where is the number of variables in the MBA expression. Thus, it has an overwhelming overhead to produce and store the tables for any .

In this study, we propose a novel scheme to simplify an MBA expression without any precomputed requirements. The key idea is that a transformation procedure can be used to reduce a bitwise expression to a unified form, and a mathematical proof is provided to guarantee the correctness of the transformation. Then, the arithmetic reduction is smoothly performed to further simplify the expression and generate the final result. We implement the approach as an open-source tool, named MBA-Flatten. To demonstrate the capability of MBA-Flatten, we evaluate it on two comprehensive MBA benchmarks. The evaluation results show that MBA-Flatten outperforms existing tools in terms of more solved MBA expressions. Due to the low-cost arithmetic computation, MBA-Flatten is also an effective MBA simplification tool. In addition, the evaluation demonstrates that MBA-Flatten can assist malware analysis and boost SMT solving on MBA equations.

In summary, this study makes the following key contributions:(1)We find that a bitwise expression can be transformed into a unified form and provide a mathematical proof to support it. To the best of our knowledge, we are the first to prove the existence of the transformation.(2)The bitwise expression transformation paves the way for our novel in-place MBA simplification method. Our proposed scheme first replaces the bitwise expressions with the corresponding equivalent form. In this way, arithmetic reduction rules can be seamlessly applied to further produce the simplification result.(3)We have implemented our idea as an open-source tool, called MBA-Flatten, and evaluated it on two comprehensive MBA benchmarks. The evaluation results demonstrate that MBA-Flatten is a general and effective MBA simplification method.

The remainder of this study is structured as follows. Section 2 shows the background of MBA expression. Section 3 illustrates the proposed scheme that can be used to simplify an MBA expression. The proof of Theorem 1 can be found in Section 4. In Section 5, we describe the experimental evaluation of the proposed approach. Section 6 discusses some limitations of our proposed scheme, and Section 7 concludes this study.

In this section, we first introduce the background of MBA expression and its wide applications. Then, we discuss the existing research on simplifying MBA expressions, pointing out the limitations, which also serve as a motivation in this study.

2.1. MBA Expression

Zhou et al. [1, 2] propose the concept of mixed Boolean-arithmetic (MBA) expression based on Boolean-arithmetic algebra, which mixes the usage of bitwise operators (e.g., NOT, AND, and OR) and arithmetic operations (e.g., , , and ). MBA expression is specified as linear MBA, polynomial MBA, and non-polynomial MBA [1, 11]. The formal definitions of linear and polynomial MBA expression are denoted as follows, and the linear MBA expression is a subset of polynomial MBA expression [1]. The MBA expression, which fails to satisfy Definition 1, is considered as a non-polynomial MBA expression [11].

Definition 1. (Zhou [1]). A polynomial MBA expression is of the form:where is integer constant,is bitwise expression of variablesover, are positive integers, and, .

Definition 2. (Zhou [1]). A linear MBA expression is a polynomial MBA expression of the form:where is integer constant,is bitwise expression of variablesover, , are positive integers, and.
Zhou et al. [1] design a generator using truth tables to produce infinite linear MBA equations. Based on existing linear MBA rules, Liu et al. [3] propose several formal methods to generate an unlimited number of polynomial and non-polynomial MBA expressions. Examples of MBA expressions are shown below. In particular, (3) is a linear MBA expression, (4) is a polynomial MBA expression, and (5) is a non-polynomial MBA expression.Due to its solid theoretical foundation and simplicity of implementation, MBA expression has been applied in multiple academic tools and industrial products to protect software [59]. For example, Cloakware, Irdeto, and Quarkslab apply MBA obfuscation in their commercial products [5, 7]. Tigress [6], an academic C source code obfuscator, encodes simple expressions into complex MBA forms. Blazy et al. [8] develop a C program obfuscator, in which formally verified MBA obfuscation rules are integrated. Ma et al. [9] apply MBA expressions to develop a novel dynamic software watermarking scheme. Figure 1 shows how to use MBA expressions to make software obfuscation [4]. Figure 1(a) demonstrates that the expression is substituted with a complex but equivalent expression. The opaque predicate [19] is shown in Figure 1(b), andthe predicate () is actually always true.

2.2. MBA Expression Simplification

The wide practical application of MBA obfuscation has encouraged research on simplifying MBA expressions. Eyrolles’ PhD thesis [10] shows that popular symbol software (Maple, SageMath, Wolfram Mathematica, and Z3 [20]) fails to simplify MBA expressions. The root cause is that existing reduction rules cannot reduce expressions that mix the usage of bitwise and arithmetic operators [11]. Researchers have developed multiple solutions to simplify MBA expressions, including bit blasting [12], pattern matching [13], program synthesis [1416], and deep learning-based [17, 18]. While promising, these simplification methods are still in their infancy: they either suffer from high-performance penalties, or they produce many false simplification cases.

To effectively reduce MBA expression, researchers investigate the MBA mechanism and propose table-based solutions. Liu et al. [5] prove a two-way feature in the MBA transformation and design a two-variable transformation table to simplify MBA expression. Xu et al. [11] create multiple semantic-preserving transformation tables, which enumerate all bitwise expressions and the corresponding simplified forms. Using these transformation tables, MBA-Solver can effectively simplify an MBA expression.

So far, table-based solutions are the state-of-the-art MBA simplification methods. However, the space complexity of the transformation table is and is the number of variables in the MBA expression. Therefore, table-based solutions are not scalable to reduce an MBA expression involving five or more variables. Here, (6) is a linear MBA expression with five variables, and table-based solutions fail to simplify it. Note that multiple methods are proposed to generate an unlimited number of MBA expressions [1, 3], and thus, an emerging challenge for MBA simplification is the MBA expression with five or more variables.

3. The Proposed Scheme

To reduce MBA expressions, we first present an existing finding: a bitwise expression can be transformed into a unified form. This finding paves the way for our novel in-place MBA simplification scheme, MBA-Flatten.

3.1. Bitwise Expression Transformation

A bitwise expression is denoted as of variables , . The transformation is defined as follows:where . Equation (7) can be recursively applied to transform a bitwise expression into an arithmetic expression denoted as , which is shown as follows:where and are integers determined by . After replacing all in with , (8) will be reduced as follows:

An instance of the above transformation procedures is shown in Example 1. One interesting observation is that there is a gap between and , because is equal to the expression rather than .

Example 1. For a bitwise expression , we haveMoreover, Theorem 1 shows that the gap between a bitwise expression and the corresponding is actually a constant value, or . In other words, a bitwise expression can be successfully reduced to a unified form, Equation (10). Theorem 1 can be proved by induction on the number of bitwise operators in the bitwise expression . For detailed proof of the theorem, refer to Section 4 of this study.

Theorem 1. Let be positive integers, be a bitwise expression of variables , , and with the form ofThen, with .
By this theorem, Example 2 shows that a bitwise expression is reduced to .

Example 2. For a bitwise expression , we haveThe above procedures introduced so far are integrated into Algorithm 1. The algorithm takes a bitwise expression as the input and outputs the transformation result . Algorithm 1 applies arithmetic computation to transform a bitwise expression, so it does not introduce extra memory cost to maintain the heap or precomputed tables.

(i)Input: a bitwise expression .
(ii)Output: the simplification result of .
(1)Function BitTrans
(2) Recursively apply the transformation to transform into .
(3) Replace all in with to get .
(4).
(5) Return .
(6)End function

3.2. Simplifying MBA Expression

As noted above, Algorithm 1 can transform a bitwise expression into a unified form. Using Algorithm 1, we will discuss how to simplify linear, polynomial, and non-polynomial MBA expressions.

We first introduce how to simplify a linear MBA expression. According to Equation (2), a linear MBA expression is essentially a linear combination of bitwise expressions. Using Algorithm 1, the bitwise expressions in (2) are first substituted with the corresponding transformation result. After combining like terms, (2) will be reduced to the following simple form:where is integer, . (13) indicates that a linear MBA expression can be simplified to the concise form including at most terms and is the number of variables in the MBA expression. Example 3 shows that a complex linear MBA expression can be reduced to a simple result .

Example 3. For the MBA expression in Figure 1(a), we haveEnlighten by the above simplification procedure, using Algorithm 1, (1) will be transformed to an equivalent form shown as follows:where are integers, , and , . The following example shows how to simplify a polynomial MBA expression. First, every bitwise expression is substituted with the equivalent form; e.g., is replaced with . Then, arithmetic reduction rules are performed to produce the simplification result . Note that the linear MBA expression is also polynomial, so the polynomial MBA simplification method can reduce a linear MBA expression.

Example 4. For the MBA expression in Figure 1(b), we haveFor a non-polynomial MBA expression, we notice that it includes multiple sub-expressions obfuscated by polynomial MBA rules. This finding inspires us to use the polynomial MBA simplification procedure to reduce a non-polynomial MBA expression. In particular, we first simplify the inner sub-expression (polynomial MBA expression), and the simplification result of the inner sub-expression is treated as a temporary variable to expose further reduction opportunities. An instance is shown in Example 5. During the simplification procedure, the inner polynomial MBA expressions are reduced to the simplified form, such as , which is reduced to . By replacing with an intermediate variable , the expression can be further reduced to . At the last step, all temporary variables are substituted back to produce the final result .

Example 5. For the non-polynomial MBA expression , we have

3.3. Algorithm and Implementation

The MBA simplification scheme we have described above is illustrated in Algorithm 2. The algorithm takes an MBA expression as input and outputs its concise form. First, it checks whether the MBA expression is a polynomial MBA or not. For polynomial MBA, the algorithm applies Algorithm 1 to simplify the bitwise expressions. Then, an arithmetic reduction is performed to return the simplification result. For non-polynomial MBA, the algorithm applies the polynomial MBA simplification procedure to recursively reduce each inner sub-expression (polynomial MBA) and replace it with the simplified result. At last, the algorithm performs the arithmetic reduction to generate the final result. Note that Algorithm 2 applies Algorithm 1 and arithmetic computation to simplify an MBA expression, so it does not introduce any additional tables or manage extra heap memory.

(i)Input: an MBA expression .
(ii)Output: the simplification result of .
(1)Function MBA-Flatten
(2)If is a polynomial MBA expression then
(3)Return PolySim .
(4)Else
(5)For inner sub-expression is a polynomial MBA expression do
(6) PolySim .
(7)Replace with .
(8)Replace with temp variable .
(9)End for
(10)Replace all with .
(11)Arithmetic reduction on .
(12)Return .
(13)End if
(14)End function
(15)Function PolySim
(16)For every bitwise expression do
(17) BitTrans .
(18)Replace with in .
(19)End for
(20)Arithmetic reduction on
(21)Return .
(22)End function

We implement Algorithm 2 as an open-source tool, named MBA-Flatten. It accepts a complex MBA expression as the input and outputs the corresponding simplification result. An overview of MBA-Flatten’s architecture is shown in Figure 2. The whole framework is written in around 1,800 lines of Python code. The parser and AST traversal components are coded based on the Python AST library. Moreover, we leverage the Python SymPy library for arithmetic reduction.

Inside MBA-Flatten, the main program consists of three major components. First, a parser receives the MBA expression and translates it to abstract syntax tree (AST) for the remaining process. Then, MBA-Flatten reduces the expression to a concise form. For polynomial MBA expression, the program uses the transformation procedure to reduce a bitwise expression, and a math reduction module is adopted to further simplify the expression. The math reduction module also includes the optimization function to generate an optimal result for some expressions; e.g., can be further reduced to . For non-polynomial MBA expression, MBA-Flatten traverses the AST bottom-up and simplifies every inner subtree (polynomial MBA expression). After reducing each sub-expression, the simplified expression is replaced with the temporary variable. At last, arithmetic reduction rules are further performed to reduce the expression and return the final simplification result. MBA-Flatten also includes utilities for measuring the complexity metrics of MBA expressions, such as counting the number of nodes in the directed acyclic graph (DAG) representation of an MBA expression, and we will discuss the complexity measurement of MBA expressions further in Section 5.1.

4. Proof of Theorem 1

To prove Theorem 1, we first present that the transformation is well defined. The definitions of value and form equivalence between two MBA expressions are shown as follows.

Definition 3. Suppose two MBA expressions of variables . if for all if and are of the same form
The maps in Equation (7) are identical in one-bit space. In other words, the bitwise expression is equivalent to with , which is shown as follows:Proposition 1 shows that the transformation is well defined, and one instance is shown in Example 6.

Proposition 1. Let be the bitwise expression of variables . Given two bitwise expressions, if, then.

Proof. induces . According to Equation (18), there is . Note the uniqueness of , and then, . Since , we have .

Example 6. For the bitwise expressions , , and . We haveThus, .
Next, we present the concept of the signature vector shown as follows. The signature vector of a linear MBA expression is a vector with dimensions, where is the number of variables in the expression.

Definition 4. (Xu [11]). Letbe a linear MBA expression, whereis integers andis bitwise expressions. Let M be theBoolean matrix representing the truth table of, The signature vectoris the product of the MBA truth table matrixand the coefficient vector.Table 1 shows the truth table of multiple 2-variable bitwise expressions, and the column with all “1” is encoded as “−1” [1, 11]. Using Table 1, Example 7 presents the procedure of calculating the signature vector for expression . The signature vector of a bitwise expression is actually to treat its corresponding truth table as a column vector, such as .

Example 7. For a linear MBA expression , using Table 1, we haveThen, we introduce the following lemma.

Lemma 1. (Xu [11]). Given two linear MBA expressionsand,, if and only if.
Using Proposition 1 and Lemma 1, Theorem 1 can be proved as below.

Proof. Let be the th element of , . Note that Equation (11) is a linear MBA expression, , and or .
We prove using mathematical induction on the number of bitwise operators in the expression of variables .
Base step: the basis is the bitwise expression with a single bitwise operator, which is one of the following four cases:where .

Case 1. Suppose , we have , and then, ; thus .If , then and If , then and Therefore, .

Case 2. Suppose , we have , and then, ; thus, . It is plainly correct that .

Case 3. Suppose , we have , and then, ; thus .If and , then and If and , then and If and , then and If and , then and Therefore, .

Case 4. Suppose , proven as above.
The above four cases led to or and that implies . By Lemma 1, holds where variables .
Induction step: assume holds with bitwise operators in . Performing one more bitwise operator to , the new expression is one of the following forms:where . Due to the commutative law of bitwise operators and the following equations:we only need to show that holds on the following four cases with bitwise operators:Assume with , and we get and the following inductive hypothesis:

Case 5. Suppose ; from the inductive hypothesis (Equation (26)), we haveThen, ; thus,According to (27), we get If , then and If , then and Therefore, .

Case 6. Suppose ; from the inductive hypothesis (Equation (26)), we haveThen, ; thus,If , then andIf , then andTherefore, .

Case 7. Suppose ; from the inductive hypothesis (Equation (26)), we haveThen, ; thus,According to (27) and (31), we get If , then and If , then and Therefore, .

Case 8. Suppose , proven as above.
The above four cases led to or and that implies . By Lemma 1, holds where variables .
Assume with ; from the similar discussion as above, we have with variables .
As discussed above, the induction is completed. Thus, we have with variables and or determined by .

5. Experimental Results

In this section, a set of experiments are conducted to evaluate the MBA simplification scheme, MBA-Flatten. We first run MBA-Flatten and existing peer tools on two comprehensive MBA benchmarks. Z3 SMT solver [19] is used to check whether the simplified result is equivalent to the original MBA expression. The corresponding simplification results are discussed in Section 5.25.4. As reported in Section 5.5 and 5.6, MBA-Flatten can assist humans in analyzing software. At last, Section 5.7 studies MBA-Flatten’s performance data, such as running time and memory footprint.

5.1. Experimental Setup
5.1.1. Peer Tools for Comparison

We collect and check existing state-of-the-art MBA simplification tools: MBA-Blast [5] and MBA-Solver [11]. MBA-Blast is a Python tool for simplifying MBA expressions via a two-variable transformation table. MBA-Solver produces multiple precomputed transformation tables, which enumerate all bitwise expressions and corresponding concise forms. Then, MBA-Solver uses these tables to simplify an MBA expression. For a more thorough evaluation, we also check other MBA simplification tools: GraphMR [18], SSPAM [13], and Syntia [14]. GraphMR is a neural network-based solution to reduce an MBA expression. SSPAM (symbolic simplification with pattern matching) is a pattern matching method that detects and reduces MBA expressions by multiple known MBA rules. Syntia is a program synthesis framework for approximating the semantics of expressions. It uses a set of input-output samples from the expression, learns the semantics of the samples, and synthesizes a simpler expression that is equal to the original expression.

5.1.2. Benchmarks

To fully expose the capability of diverse methods on simplifying MBA expressions, a large scale of MBA expressions is required for evaluation. Therefore, we consider two comprehensive MBA benchmarks: Dataset 1 [14] and Dataset 2 [11]. Dataset 1 comprises 500 MBA samples generated by Tigress [6] with up to three variables. Dataset 2 collects 3,000 MBA equations with up to four variables, which contains 2,000 polynomial MBA (1,000 linear MBA) and 1,000 non-polynomial MBA expressions. Every sample in datasets is a 2-tuple: (, ). is the complex MBA expression, and is the related equivalent simple form. Multiple samples in benchmarks are shown in Table 2.

5.1.3. MBA Complexity Metrics

We use the following metrics to measure MBA complexity: number of DAG nodes and MBA alternation. For example, the expression , whose DAG representation is shown in Figure 3, has 8 nodes and an MBA alternation (a red arrow means one MBA alternation) of 2. The larger a metric’s value, the more complex an MBA expression. We expect the metrics’ values will be reduced after simplification.(1)Number of DAG Nodes. An MBA expression is transformed into a directed acyclic graph (DAG) representation in which the nodes are operators, variables, and constants. The number of nodes in the DAG is defined as a complexity metric for an MBA expression.(2)MBA Alternation. The MBA complexity mainly comes from mixing bitwise operations and arithmetic operations. We adopt “MBA alternation” to measure the number of edges linking different types of operations in the DAG representation of an MBA expression.

5.1.4. Machine Configuration

All of our experiments are performed on a server with Intel Core i9 3.00 GHz CPU, 64 GB DDR4 RAM, 2 TB SSD Hard Drive, and running Ubuntu 20.04 OS.

5.2. Simplification on Dataset 1

In the first experiment, we run MBA-Flatten and other peer tools on Dataset 1. The evaluation result in Table 3 shows that only MBA-Flatten successfully produces verifiable simplification outputs for all MBA expressions with negligible overhead (within 0.1 seconds).

We first study the correctness that means an expression before and after simplification is semantically equivalent. Z3 solver [19] is adopted to check whether the output of a simplification tool is equivalent to the input. The solver may not return the solving result due to the MBA’s complexity, so we set 1 hour as a practical threshold for this and the following experiments.

Table 3 presents the number of MBA expressions that can be reduced by simplification tools. GraphMR is trained on the linear MBA dataset, so it can only simplify 137 of 500 MBA expressions. SSPAM outputs 168 wrong simplification results because of the limited number of MBA rules in the pattern library. Syntia uses stochastic program synthesis to generate a simple expression, which successfully synthesizes 369 simplification results. MBA-Blast performs well on simplifying 2-variable MBA expressions rather than three or more variables, and therefore, it generates 416 simplification results. MBA-Solver can successfully simplify the majority of the MBA expressions (454 of 500), but it cannot process several special cases, e.g., the non-polynomial MBA expression including sub-expression . In contrast to MBA-Solver, MBA-Flatten can successfully simplify all 500 MBA samples, and it reduces to the expression .

Next, we investigate the effectiveness that reflects how much complexity is reduced by the simplification methods. Table 4 reports the expression complexity before and after simplification. Two quantitative metrics are used to measure expression complexity: the number of DAG nodes and MBA alternation. Table 4 shows that all simplification tools (except SSPAM) can considerably reduce the complexity measurement of the solved MBA expressions. SSPAM cannot effectively reduce a complex MBA expression to a simpler form due to the limited known MBA rules used in the software.

5.3. Simplification on Dataset 2

As the second experiment, we run MBA-Flatten and other baseline tools on Dataset 2. As shown in Table 5, MBA-Flatten can successfully simplify 2,943 of 3,000 MBA expressions, and its average processing time is less than 0.2 seconds.

Considering the MBA expression in Dataset 2 is more complex and diverse than the one in Dataset 1, this experiment exposes more detailed findings. GraphMR and Syntia have limited effect on simplifying complex MBA expression, which can only correctly simplify less than 450 MBA samples. SSPAM cannot generate a simpler expression, so nearly 2/3 (1,975 of 3,000) of the simplified results cannot be checked by the Z3 solver within the time threshold. Compared with MBA-Blast (1,763 simplified samples), MBA-Solver can reduce more MBA expressions with three or four variables, and it successfully simplifies 2,899 MBA samples. MBA-Flatten can reduce 2,943 MBA samples, but it fails to simplify several special cases. One exception is the non-polynomial MBA expression . Table 6 reports that all solutions (except SSPAM) can generate a simpler equivalent expression. Overall, MBA-Flatten presents its advanced capability by successfully simplifying 98.1% of MBA samples.

Furthermore, we compare the average solving time of simplification tools on the two benchmarks. From Tables 3 and 5, the simplification time of GraphMR and Syntia is almost not increased, but SSPAM takes much more time when it simplifies a more complex MBA expression. MBA-Blast takes less than 0.1 seconds to simplify a two-variable MBA expression. Compared with MBA-Solver, MBA-Flatten takes slightly more time to simplify an MBA expression. The main reason is that MBA-Solver directly gets the bitwise expression simplification results from the transformation tables, rather than reduces it by multiple simplification procedures.

5.4. Case Study

The evaluation results in Tables 3 and 5 show that MBA-Solver and MBA-Flatten are the most powerful MBA simplification tools. Throughout this case study, we demonstrate the strengths and weaknesses between MBA-Solver and MBA-Flatten.

We manually check the MBA expressions solved by MBA-Solver or MBA-Flatten, and one interesting observation is that MBA-Flatten can reduce all polynomial MBA expressions in the datasets, as MBA-Solver does. Does this scenario mean that MBA-Solver and MBA-Flatten can be substituted for each other? The answer is relevant to the number of variables in an MBA expression: as described in Section 2.2, MBA-Solver can successfully simplify a polynomial MBA expression with up to four variables; compared with MBA-Flatten, MBA-Solver is more efficient when it reduces an MBA expression. However, MBA-Flatten can simplify a polynomial MBA expression with an arbitrary number of variables, and the form of simplification result is shown in Equation (15). The following example shows how to apply MBA-Flatten to simplify Equation (6), which is an MBA expression with five variables.

Example 8. For Equation (6), we haveand thus,The other observation is that MBA-Flatten can simplify all non-polynomial MBA expressions solved by MBA-Solver, but not vice versa. It is because that MBA-Solver treats the common sub-expression as an intermediate variable, rather than a sub-expression itself. Therefore, MBA-Flatten can simplify more special cases that cannot be simplified by MBA-Solver. Moreover, MBA-Flatten can reduce a non-polynomial MBA expression with five or more variables.
From the description above, MBA-Flatten is a general MBA simplification method.

5.5. MBA-Powered Malware Deobfuscation

MBA expression is always used to obfuscate code, so malware developer also adopts the MBA expression to complicate the program. Liu et al. [5] report that MBA expressions are used in a ransomware sample to protect the encryption key, and they also observe that MBA rules are integrated into the software obfuscator VMProtect, which is widely used by malware developers.

In this experiment, we demonstrate that MBA-Flatten can assist in reverse-engineering the malware obfuscated by MBA expressions. We collect all MBA expressions used in malware from existing work [5]. Then, MBA-Flatten is applied to simplify the expressions, and the Z3 solver is used to check the correctness of the simplified result. The evaluation result shows that MBA-Flatten can successfully simplify all MBA expressions collected from existing malware samples. One simplification procedure is shown as follows, and MBA-Flatten produces the final result .

Furthermore, we replace the MBA expressions used in malware with new MBA expressions involving five or more variables and produce 130 variants, such as the above expression , which is replaced with Equation (6). We apply MBA-Blast and MBA-Flatten to simplify the new MBA expressions. Unfortunately, MBA-Blast fails to simplify them. In contrast, MBA-Flatten can successfully simplify all new MBA expressions. Therefore, this experiment shows that MBA-Flatten can simplify the MBA expressions used in existing malware and the complex MBA expression with five or more variables.

5.6. Boosting SMT Solving MBA Equations

Satisfiability modulo theory (SMT) solvers have been widely applied in diverse software engineering areas, such as software analysis [21, 22], symbolic execution [23, 24], and test generation [25]. Existing work [10, 11] has presented that SMT solvers are hard to solve MBA equations. However, the MBA simplification method, MBA-Solver, can be used to boost the SMT solver’s performance on solving MBA equations (11).

In this experiment, we report that MBA-Flatten (denoted as MF) can assist SMT solvers in solving MBA equations. We consider the benchmark from work [11] and test three popular SMT solvers: Boolector [26], STP [27], and Z3 [20]. The benchmark is actually considered as Dataset 2 in this study, and MBA-Solver (denoted as MS) is considered as the baseline. MBA-Flatten and MBA-Solver are used to simplify all MBA equations in the benchmark, and then, the simplification results are output to the three SMT solvers.

The evaluation result is shown in Table 7, and the solving time threshold is set as 1 hour. Before simplification, all three SMT solvers can only solve a small portion (Boolector 496 (16.5%), STP 98 (3.3%), Z3 84 (2.8%)) of the MBA equations within the time threshold, but after simplification, all three solvers can solve over 96% of MBA equations. Compared with MBA-Solver, all SMT solvers can solve more MBA equations after MBA-Flatten’s simplification. This is because MBA-Flatten can successfully simplify more MBA expressions than MBA-solver, as shown in Table 5. After MBA-Flatten’s simplification, all SMT solvers can solve 2,943 of 3,000 MBA equations, which means that the distinction between solvers’ performance on solving MBA expressions becomes insignificant. These results indicate that MBA-Flatten is a generic method to boost SMT solver’s performance on solving MBA expressions.

5.7. Performance

This section reports MBA-Flatten’s performance data. Table 8 shows the time and memory cost when MBA-Flatten processes an MBA expression with different complexity measured by the number of nodes. For every complexity measurement, 100 different MBA expressions are generated to do the test. As some of the timings are small, we repeat every test 100 times. MBA-Flatten is effective because it only performs low-cost arithmetic computation. Our implementation adopts the Python SymPy library to efficiently perform the arithmetic reduction. Overall, MBA-Flatten is an effective tool for simplifying MBA expressions.

6. Discussion

MBA-Flatten has demonstrated the feasibility of automatically reducing MBA expressions. However, we also note some potential enhancements for future improvement.

As introduced in Section 5.3, MBA-Flatten cannot simplify the non-polynomial MBA expression . We further investigate how to reduce it, and the simplification procedure is shown below. During the simplification procedure, the sub-expression is treated as an intermediate variable rather than the expression . However, it is hard for an automatic tool to precisely detect and identify the sub-expression, such as the sub-expression . To mitigate this problem, one possible solution is to integrate multiple heuristic rules into MBA-Flatten. Therefore, MBA-Flatten can explore diverse reduction directions to generate a simpler result.

It is possible that an adversary attacks MBA-Flatten by combining MBA obfuscation with other obfuscation techniques to generate an expression that does not satisfy the MBA definition in this study. Note that MBA-Flatten is designed for simplifying MBA expressions, so it may correctly handle the certain MBA sub-expression, but cannot solve the remaining non-MBA part. It is interesting to further investigate whether MBA-Flatten can interact with other analysis techniques (e.g., symbolic execution) to produce a better result.

7. Conclusion

Existing work performs well on simplifying MBA expression with very few variables. However, the state-of-the-art methods are hard to simplify a multivariable MBA expression. We investigate it and address this challenge using an in-place simplification method. A transformation procedure is proposed to transform a bitwise expression into a unified form, and we provide a mathematical proof to guarantee the correctness of this transformation. Then, the arithmetic reduction is used to further simplify the expression and produce a simplified result. Our large-scale experiments show that MBA-Flatten is a general and effective MBA simplification method. Furthermore, developing MBA-Flatten not only advances automated malware analysis but also boosts SMT solving on the MBA equations.

Data Availability

The data and codes presented in this study are available at https://tinyurl.com/y5l948pu.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

Acknowledgments

The authors would like to thank team members from Anhui Province Key Laboratory of High Performance Computing for their valuable suggestions. The authors appreciate Jingyao Ke for proofreading the paper. This work was supported by the Core Electronic Devices, High-End Generic Chips and Basic Software of National Science and Technology Major Projects of China, under Grant no. 2012ZX01034-001-001.