Abstract

Smartphone usage has been continuously increasing in recent years. In addition, Android devices are widely used in our daily life, becoming the most attractive target for hackers. Therefore, malware analysis of Android platform is in urgent demand. Static analysis and dynamic analysis methods are two classical approaches. However, they also have some drawbacks. Motivated by this, we present Demadroid, a framework to implement the detection of Android malware. We obtain the dynamic information to build Object Reference Graph and propose -VF2 algorithm for graph matching. Extensive experiments show that Demadroid can efficiently identify the malicious features of malware. Furthermore, the system can effectively resist obfuscated attacks and the variants of known malware to meet the demand for actual use.

1. Introduction

Android is a mobile operating system developed by Google, based on the Linux kernel, and designed primarily for touchscreen mobile devices such as smartphones and tablets [1]. On top of the kernel level, there are middleware, libraries, and APIs written in C programming language. And the kernel level is independent of other resources [2].

With the popularity of smartphones, the number of users of Android dramatically rises [3]. However, the popularity of Android also attracts the attention of malware, which has become an urgent threat to users [4]. According to the types of threats, malicious apps can be divided into at least six categories: abuse of value-added services software, advertising fraud software, data theft software, malicious downloading software, malicious decoding software, and spyware. Research from security company Trend Micro shows that the premium service abuse is the most common type. For example, text messages are sent from infected phones without the permission of users [5]. Android has become the hardest hit. However, Google engineers have argued that the malware and virus threat on Android is being exaggerated by security companies for commercial reasons. A survey published by F-Secure showed that only 0.5% of Android malware reported had come from the Google Play store [6].

In addition, the source of malware is very extensive. Different from the PC virus, Android malicious attack has its own features; various types of malicious codes cover almost every level. The proportion of various malware types is shown in Figure 1 [7].

Motivated by this, a great number of Android malware detecting methods are proposed which are divided into two types as follows [8].

The first kind of methods is static analysis. Static methods analyze the executable file directly instead of running it. For example, DroidDet [9] statically detects malware by utilizing the rotation forest model. However, this work cannot resist the obfuscated attack.

Another type of approaches is dynamic analysis. Different from the static methods, dynamic methods extract the malicious features at runtime, which improves the effectiveness of detection. By contrast, dynamic analysis has stronger robustness. Dynamic analysis techniques are not compatible in some cases because developing tools that allow the dynamic analysis of malware is very challenging, and such techniques require extensive resources and often do not have enough scale to be used in practice [10]. Shabtai A et al. [11] propose a new dynamic technique, sandbox, which is built by the kernel LKM (Loadable Kernel Module). They analyze the system calls from the kernel to create the log file. However, the modification to the kernel level causes the instability of operations, and the user interaction is only simulated by automatic tools, which is no real operation [12].

To address these problems, we propose a more effective Android dynamic technology to detect malware. This is a new technique of establishing dynamic birthmarks. We extract the reference relationships between objects allocated in heap memory and then establish ORG (Object Reference Graph) to build ORGB (Object Reference Graph Birthmark) as the feature. In addition, we propose -VF2 algorithm to match the subgraph isomorphism.

Compared with the existing dynamic birthmark methods, we utilize the information in heap, which can also be used to solve the problem of code plagiarism. In summary, the main contributions of this paper are listed as follows.(i)We establish ORG by extracting all the referential relationships between objects allocated in heap memory.(ii)With the analysis of the program class, we extract the feature classes to build ORGB as the birthmark of malware.(iii)Based on VF2 algorithm, we propose -VF2 algorithm to improve the false negative rate and false positive rate.(iv)We propose an Android malware detection system Demadroid which resists the obfuscated attack. To demonstrate the effectiveness of the proposed approaches, we conduct extensive experiments. Experimental results show that the proposed system and algorithm perform well.

The rest of the paper is organized as follows. In Section 2, we discuss the related work, and we give the details of our algorithm in Section 3. Section 4 presents the framework of Demadroid. The evaluation of Demadroid is depicted in Section 5. In Section 6, we summarize the whole work.

Several approaches have been proposed recently to detect malware in Android. Generally, they are divided into static analysis and dynamic analysis.

Static analysis inspects app without executing it. Julia is a Java bytecode static tool for Android platform, but it cannot parse the classes generated by the XML file mapping. Payet É et al. [10] improve it to analyze the bytecode of Dalvik Virtual Machine. Kui Luo et al. [13] propose a bytecode conversion tool for privacy stolen malware and enable it to convert into DVM bytecodes and analyze Android programs. Literature [14] uses the existing tools dex2jar and FindBugs for analysis, which traversed the flowchart of Android programs and obtains the functional dependencies between Intent objects. The above works are based on existing tools, which have a great number of limitations. Batyuk L et al. [15] present disassembly method by disassembling the malicious code of Android. They get the malicious part and modify it to separate the malicious code. This method is effective for the untreated apps but cannot deal with the obfuscated code. Based on sensitive data access, Di Cerbo F et al. [16] study the privacy-stealing code. By analyzing the permissions feature of the program request, they compare with the defined features to determine whether the program is malicious. One important problem in this work is that Android does not have permission restrictions on the use of API. Therefore, it cannot identify the malicious code utilizing Android vulnerabilities. In a word, the drawbacks of static methods are obvious; their robustness is weak. And several attacks such as code obfuscation, Junk Code, and other antidetection techniques can easily avoid detection.

Dynamic analysis can resist the code obfuscation attack but is more expensive than static methods. Isohara T et al. [17] use a kernel-level monitoring method to record the system call of Android program. This method can effectively analyze the record of system calls. However, it is just used for the monitoring of stolen information. Based on this, Schmidt A D et al. [18] present further research and divide the monitoring into Android application layer, system application layer, and system kernel layer. However, there are no valid experimental tests to verify the feasibility of the work. Crowdroid [19] is a classifier based on anomaly detection. The system uses the existing Strace program to monitor system calls and create record files. After being uploaded to the server, the files are classified by the K-Means algorithm. However, in this case, the amount of data and the network traffic of the system are relatively large, and the problem of data security is brought at the same time. Attackers can easily fabricate the key information and interfere with the result. Shabtai A et al. [11] mention a dynamic analysis technique, sandboxing, which is a new direction for Android malicious code detection. However, the current sandbox technology is incomplete. Myles et al. [20] use the control flow of apps to identify malicious behaviors. Experiments show that control-flow analysis is more effective than static birthmark analysis in dealing with attacks utilizing the semantics.

3. VF2 Algorithm

3.1. Isomorphic Patterns of Graphs

In the past decades, graph matching has been one of the main research topics in computer science. In general, graph matching can be classified into two lines, exact-matching algorithms and inexact-matching algorithms. Exact-matching algorithms require strict consistency between two candidate graphs. The most stringent pattern of exact-matching algorithms is graph isomorphism, which requires the mapping of nodes and edges on both graphs to be bijections [21]. The fuzzier pattern of exact-matching is subgraph isomorphism which requires at least the strict consistency between the subgraph and the ideograph [22].

Moreover, inexact-matching algorithms, which are also called fault-tolerance matching, relax the constraints with errors and noises. Monomorphism is the inexact-matching which gets rid of the bidirectional requirement of edge-remaining bases on subgraph isomorphism. It requires that every node of the first graph can map different nodes and edges in the second graph, which allows the redundant edges and nodes. The weaker graph match pattern is the homomorphism, which is a many-to-one mapping that does not require that every node of the first graph is mapped to a different node of the second graph. Isomorphism matching is another method to match the subgraphs, of which the result is not unique. It is also used to find the largest subgraph match, which is called the maximum common subgraph (MCS).

3.2. Analysis of the Subgraph Isomorphism Matching Algorithm

All the isomorphic patterns are NP-complete problems except graph isomorphism. Whether graph isomorphism is NP-complete problem has not been proved till now [23]. At present, polynomial time algorithms are matched for special types of graphs, and there is no general polynomial time algorithm for general graphs. For this reason, the time complexity of the exactly matching algorithm is exponential in the worst case. However, in practical problems, the cost of time is basically acceptable. Because the type of graph encountered in practical problems is not the worst case and the attributes of the nodes and edges can greatly reduce the search time.

The problem of graph isomorphic matching is a very classic problem in graph theory, and the algorithms used in different scenarios are different. In practice, the data required for the establishment of a graph will inevitably be disturbed; that is why graph isomorphism is rarely used. Subgraph isomorphism and monomorphism are commonly used patterns. They are more effective in dealing with practical problems. Many algorithms have been developed for these two problems. At present, the exact match algorithm is more effective for the basic graphs and searching for MCS.

3.2.1. Ullmann Algorithm

One of the most important types of graph matching algorithm is the Ullmann algorithm [24], which was proposed in 1976. It can solve the isomorphic problems, such as isomorphism, subgraph isomorphism, and monomorphism. At the same time, the algorithm also provides a way to deal with the maximum matching, so it can also be used to solve the CMS problem.

To reduce the bad matching branches, Ullmann algorithm proposes predictive equation to control backtracking process, significantly reduce the scale of search space, and improve the performance of the algorithm.

3.2.2. Ghahraman Algorithm

Ghahraman proposed another backtracking based monomorphism algorithm in 1980 [25]. To reduce the search space, a technique like association graph is used in this paper. The matching search is carried out on the NetGraph matrix. This matrix is generated by the product of the Descartes product between the nodes of the matched two graphs. The monomorphism matching of the two graphs is related to a subgraph of the NetGraph. The author finds two necessary conditions for the partial matching to produce the result.

One of the main disadvantages is that the storage of NetGraph requires at least one matrix of size, in which N represents the number of nodes. Therefore, this algorithm is more suitable for a graph with lower number of nodes.

3.2.3. Nauty Algorithm

Nauty algorithm [26] is the most famous tree search algorithm which is not based on backtracking. It only deals with the isomorphic problem and is recognized as the fastest one. By using the conclusion group theory, it creates an automorphism group for each input. And every automorphism group produces a standard label to guarantee that the only node order is introduced by each equivalent class of the automorphism group. Then, the isomorphic comparison of the two graphs is equivalent to the adjacency matrix comparison of the standard label.

The time complexity of comparison is of the worst case. In most cases, the time performance is acceptable. Because the establishment of standard tags can be carried out independently. Therefore, it is more suitable for the graph matching in a large library.

3.2.4. VF and VF2 Algorithm

The VF algorithm proposed by Cordellac [27] is applied to both isomorphism and subgraph isomorphism. Cordellac defined a heuristic search by analyzing the adjacent nodes of matched nodes. This heuristic algorithm is significantly better than Ullman and other algorithms in many cases.

Cordella improved the algorithm in 2001, which is called the VF2 algorithm [28]. The improvement reduces the space complexity from to , in which N donates the number of nodes. In this way, the algorithm can be applied to the matching of large graphs.

The VF2 algorithm is also used in many other related fields. For example, Jonathan Crussell et al. propose DNADroid [29], a tool which uses VF2 algorithm to detect cloned apps. In this work, VF2 algorithm is used to compute subgraph isomorphism. The experiment proves that VF2 algorithm is suitable for graphs containing a variety of node types.

3.3. Comparison of Subgraph Isomorphism Matching Algorithms

In this section, we analyze several classical algorithms mentioned in Section 3.2 and select the proper algorithm as the foundation of our matching process. The main types of graph include the bounded Valence Graph, the two-dimensional grid graph (2D Mesh Graph), and the random connection graph (Randomly Connected Graph). Foggia et al. analyze the above algorithms by experiments [30]. The ORG in our work is similar to random connection graph and quite different from the other two kinds. Therefore, we only discuss the condition of random connection graphs. Foggia uses a control group with different density of nodes and edges. The experimental result shows that VF2 algorithm and Nanty algorithm are better than Ullmann algorithm in dealing with random connection graphs. VF2 performs better than VF algorithm when the density is different. Compared with Nauty algorithm, VF2 algorithm has a better effect to match spares graphs. And Nauty algorithm is more applicable to dense graphs.

In this paper, we match the subgraphs between object reference dependency graphs, in which nodes represent classes, and directed edges represent references between classes. According to the analysis of samples, the number of nodes in ORG is within 100. Therefore, the algorithm used in this paper is based on VF2 algorithm.

3.4. Review of VF2 Algorithm

VF2 algorithm is applicable to isomorphism, subgraph isomorphism, and monomorphism because it does not impose restrictions on the topology of matched graphs. The algorithm adopts the concept of state space representation (from now on SSR) in the matching process and proposes five feasible rules to prune the search space. Compared with VF algorithm, the most significant improvement is the strategy of traversing the search tree and the data structure making the algorithm applied to match the graph with thousands of nodes.

The primary idea of the VF2 algorithm is as follows. Given the digraphs and , shown in Figure 2, we are looking for the isomorphic mapping between them. Map M is used to express , in which donates a node of and donates a node of . The process of finding the mapping is described by SSR. Each state in the matching process is a partial mapping , which is a subset of donates the subgraph of the mapping associated with , and donates the subgraph of matched by and , respectively, represent the set of vertices in and . and , respectively, denote the edge set in and . Given the middle state sp, the partial is as follows:

There are multiple states in the matching process, and state is converted to another state by adding a pair of new nodes. By adding different pairs of nodes, is converted to various states. In this way, the new state is described using a tree structure in which parent node represents the original state and the child node represents the new state. In Figure 2, converts to sq after adding node . Figure 3(a) shows that the node pairs are just one of many possible ones. Therefore, we need to select the appropriate state by backtracking the search tree. In Figure 3(b), after joining , G1(sp) and G2(sp) are successfully converted to G1(sq) and G2(sq).

In the matching process, is obtained by searching the SSR. VF2 algorithm proposes five feasible rules to reduce the time complexity by pruning the search space. According to the proposed rules, the unsatisfied child nodes are removed. The remaining nodes set is called the candidate set , which is traversed in the depth-first order. The pseudocode of VF2 algorithm is shown in Algorithm 1.

Input: , , State , initialized state: , is set empty
Output: The isomorphic map:
(01) PROCEDURE VF2 Match
(02) IFTHEN
(03) Successful Match
(04) ELSE
(05) Find which is the set of possible pairs for
(06) FOREACH h in
(07) IF all rules are satisfied for h added to THEN
(08) = put into
(09) CALL VF2Match
(10) ENDIF
(11) ENDFOREACH
(12) Restore data
(13) ENDIF
(14) END PROCEDURE VF2MATCH

The following definitions are given:(1): it denotes a vertex set of , vertexes of which are descendent vertexes of but not contained in .(2): it denotes a vertex set of , vertexes of which are descendent vertexes of but not contained in .(3): it denotes a vertex set of , vertexes of which are antecedent vertexes of but not contained in .(4): it denotes a vertex set of , vertexes of which are antecedent vertexes of but not contained in .

The steps of selecting are as follows:(1)If and are not empty sets, then .(2)If and are both empty sets and and are not empty sets, then .(3)If , , , and are empty sets, then .(4)Other conditions prune the state .

As described above, if one of and or one of and is an empty set, state is pruned. For state , the algorithm needs to check all the candidate nodes by the feasibility function, in which denotes the current state, denotes a vertex of , and represents a vertex of . The return value of reflects whether the given node is feasible. If the node is not feasible, the path of it will be pruned.

The feasibility rules are divided into grammatical and semantic. The grammatical rules express the topological structure of the graph, and the semantic ones express the properties of the vertices and edges. In this work, we consider the grammar rules because there are no properties in edges and vertexes of ORG. Therefore, is defined as follows:

Five feasible grammar rules are defined in , in which and are the consistency of . After the candidate node is added, , , and are used to prune the search space.

denotes the set of the antecedent nodes of in figure , and denotes the set of the descendent nodes of in figure . The algorithm defines , . and are defined as , .

Rule 1 ().

Rule 2 ().

Rule 3 ().

Rule 4 ().

Rule 5 ().

The above five rules are applied to the subgraph isomorphism pattern. In addition, for isomorphism pattern, “” in , , and is replaced by “=”. If the newly added node pair is satisfied by the five feasibility rules, the algorithm adds them and continues the searching.

3.5. The Implementation of λ-VF2 Algorithm

In this section, we propose λ-VF2 algorithm based on the environment of Android to detect subgraph isomorphism between the ORG and ORGB. According to Section 3.4, the VF2 algorithm is aimed at isomorphism and subgraph isomorphism. However, for the study of ORG, in the case of subgraph isomorphism, it is still difficult to match the subgraph with the original graph. The reason is that the running time for an app injected with malicious code is not sufficient, which causes the creation of the incomplete references. Therefore, the algorithm needs to be adjusted to relax the matching condition. To relax the matching condition, the algorithm finishes when the matching ratio of vertex reaches a proper threshold.

The threshold is set as the input of the algorithm, which is determined by the user. indicates that the algorithm is terminated only when the ratio of matched vertices is bigger than or equal to λ; the algorithm returns success. In this way, the pseudocode of λ-VF2 algorithm is shown in Algorithm 2.

Input: , , state s, the initial state: , is empty, λ: Precision control parameters
Output: Isomorphic Mapping
(01) PROCEDURE VF2 Match(s)
(02) IFTHEN
(03) Successful Match
(04) ELSE
(05) Find which is the set of possible pairs for
(06) FOREACH h in
(07) IF all rules are satisfied for h added to THEN
(08) = put into
(09) CALL VF2Match
(10) ENDIF
(11) ENDFOREACH
(12) Restore data
(13) ENDIF
(14) END PROCEDURE VF2MATCH
3.6. Performance Analysis

The time and space complexity of VF algorithm is positively correlated with λ. As an input parameter, λ is independent of the algorithm. In this section, λ is considered as 1 at the worst case.

3.6.1. Time Complexity

Our algorithm is a graph SSR-based isomorphism algorithm. The time complexity consists of two parts: the time of traversing and the processing time for each state.

(i) Traversing Time. At best, each state has only one satisfied candidate node; namely, there is no need for backtracking. The total number of states that need to traverse is the number of nodes in given graph. The worst case is that there are no unsatisfied states. In the th level of the search tree, there are nodes. And the total number of tree nodes is

is less than 2. Thus, the total number of sizes is .

(ii) Processing Time of Each State. The processing time for each state consists of three parts: the calculation time of the candidate set , the calculation time of the feasible function , and the calculation time of the new state. The total time of every single state: .

: the processing time for each state in the candidate set is constant, and the maximum size of the set is . Therefore, is .

: in the process of , each edge costs constant time and the number of edges in the worst case is the number of nodes which is connected to every remaining node. Thus, .

: the calculation time of the new status includes the time of , , , , and , in which is cost constant time. And the other four sets need to iterate over the edges of the newly joined one, which is at the worst case.

is the number of edges that a node is connected to. Given a directed graph of vertexes, the number of edges connected to one given vertex achieves the maximal number of . Therefore, in the worst case.

In summary, .

Final Time Complexity. According to the above analysis, the time complexity of the VF2 algorithm is the multiplication of the two parts.In the best case, In the worst case, .

3.6.2. Space Complexity

The VF2 algorithm adopts the sharing data structure. Thus, the storage space number required by each state is constant. The searching process traverses the search tree in the depth-first order, and the maximum depth of the tree is less than . Therefore, the space complexity is .

4. Framework of Demadroid

Demadroid mainly includes two parts: Android client and PC server. Android client is responsible for extracting data and passing it to the server side, and PC server is responsible for the malware detection.

4.1. Design and Implementation of Android Client

The main function of the Android module is to extract the object reference information from a process. We construct MalwareDetection to analyze the running process (except the system process) and export the dynamic information file for further analysis.

The main components of MalwareDetection include front-end interface, active process finder, shell command executor, Convertor, and AHAT. The extraction flow is shown in Figure 4.

In general, the existing malicious code is embedded in the normal apk. After installation, the malicious code starts with the host app, sharing the process resource in memory. Objects are created in the process, each of which has mutual references with each other. The information we need includes the objects created by the injected process and references between them. We extract the information above in Android client. The reason is that the size of raw memory file is too large. For example, a lightweight app “calculator” generates a memory file of 10 M. There are many processes running in the memory at the same time. Therefore, it is necessary to extract the useful information to reduce the network burden when uploading to PC server.

There are three steps in the extraction process. The first step is the acquisition of raw heap information. The second step is to convert the raw memory file format. The third step is to analyze the dynamic information.

4.1.1. The Acquisition of Heap Memory Information Files

The Android SDK provides feature-rich memory monitoring tools, such as dumpheap tools for heap data monitoring. And it is supported by Android 2.3 version or more. To facilitate the analysis, we use AVD to virtualize Android 4.0.3 and successfully extract the heap data of the test process by dumpheap. The data extraction environment is shown in Table 1.

The dumpheap command is in the format of “am dumpheap PID path”. We integrate dumpheap into Android program. In the extraction process, we first use the adb tools to obtain the equipment information. After the execution of this command, the heap data of process is saved in files. In this way, a complete file of raw heap information is obtained. This file is binary and cannot be read directly from the contents. Therefore, the format of the binary file needs to be converted.

For example, we start the “calculator” application in the virtual device. With the obtained process ID number, we export the memory raw data of the Calculator process by dumpheap.

4.1.2. The Format Conversion of the Memory Information File

The raw memory data is binary and it cannot be analyzed directly. We develop Convertor to convert it into an available format.

The analysis tool we propose, AHAT, is based on JHAT, which is used in PC environment. The version of the binary memory file generated by dumpheap is 1.0.3, while the version JHAT can analyze is 1.0.2, and the file format needs to be converted from 1.0.3 to 1.0.2 on Android platform.

The function of Convertor is similar to HprofConv tools of SDT, which is used in PC environment. The first step is to analyze the two versions. The binary file format produced by dumpheap is shown in Box 1. The format of the binary file is fixed, beginning with a version string, such as “Java PROFILE 1.0.2”, followed by the 4-byte ID information, followed by 8-byte file creation date information. After creation date information is the memory data, which is the body of the binary file.

The memory data consists of units. Each of these units stores the information of a Java object. The format of a unit is shown in Box 2. The data structure includes a 1-byte type field, a 4-byte timestamp field, a 4-byte data length field n, and finally the n-byte object information field.

The main difference between the two versions is that the number of types is in Detail Info field. In the old version, there are thirteen types in the Detail Info field. In the new version, nine new types are added, which are shown in Table 2.

The types shown in Table 2 make the information unanalyzable. The solution is to remove the new types, which is irrelevant to our work.

We use the unit types as the member of Convertor class, which is used in the analysis process to determine whether a given type is useful. Finally, the file is reorganized in the format of the 1.0.2 version.

4.1.3. Extraction of Object and Reference Information

We develop AHAT, a tool used to analyze binary files in Android which is similar to JHAT in PC environment. AHAT mainly consists of four parts: Model, Parser, Util, and external call interface. The relationship between the four modules is shown in Figure 5.

Model. It defines the types (data structures) of all involved objects, and the objects of these data structures constitute a model. There are 29 classes corresponding to object types of Java, the most important of which is the Snapshot, the largest unit of the memory snapshot model.

Parser. It is used for reading binary files, analyzing data, and using it with model objects to build a model. Parser consists of 7 classes; the main class is HprofReader, used for heap binary parsing.

Util. it is a common toolkit.

External Call Interface. AHAT is responsible for invoking each module to make it work properly. The activity class is interacting with the user on Android, so the main class is the MainActivity class and the QueryClassInfo class used to get the referential relationship between the classes.

According to the work process of JHAT, there are four steps in the implementation of AHAT:(1)Create: AHAT first creates a snapshot for preparing to store data.(2)Read: the HprofReader class parses the binary file to obtain the necessary information and builds the Snapshot object.(3)Resolve: the Snapshot object uses the object information to initialize the data structure which includes the reference relationships between classes.(4)Query: based on the constructed model, we query the class reference and write it in files.

4.1.4. Important Data Structures and Methods

(1)Snapshot class: It represents a Snapshot of a Java object in the JVM which contains the dynamic object information as well as references between them. The data structures involved are defined in the model module.(2)HprofReader class: It parses the binary file to extract the memory information of each unit and uses it to build a Snapshot object. After this, we initialize the data structure, calculate the specific information of each object, like package name, class name, class ID, class member variable, reference relation between classes, and so on. The above process is the key to dynamic information extraction.(3)QueryClassInfo class: The function of QueryClassInfo class is to extract the references between classes of Snapshot object. The variable referrersStat in the process function is a Hashmap which stores the referenced information of this class and the variable referrersStat is used to store the referencing information. All the classes in the memory are obtained by the function getClasses of Snapshot.(4)PlatformClasses classes: In the obtaining process of object references, there are thousands of classes returned by function getClasses, most of which are platform-supplied classes, like the Java Standard API classes, the API classes provided by the Android system, and so on. These classes are irrelevant to our work. What is more, the existence of them can cover the references between the key classes. Therefore, we remove such irrelevant classes (shown in Table 3) by function PlatformClasses.

4.1.5. Results of AHAT

The AHAT requires Android 4.0 or more. We test it on Google’s Galaxy Nexus 3, of which the environment is shown in Table 4.

The analysis process includes the reading of dumpheap files, binary data parsing, class reference relationship analysis, and the creation of result files. The result is stored in the dumpheap folder of the SD card.

4.2. Design and Implementation of Server Side

There are three parts in PC server: the establishment of ORG, the establishment of ORGB, and graph matching. The architecture of PC server is shown in Figure 6. After ORG is created, it is sent to the detection module to match with ORGB by λ-VF2 algorithm.

4.2.1. The Establishment of ORG

ORG is a digraph created by the information obtained in Android client. There is no system class in ORG, in which the nodes represent classes and the edges represent the references between classes. The flow chart of ORG establishment module is shown in Figure 7.

The node ID in the program is a number, and the class name in the file needs to be converted to ID. Thus, we create an index file to assign an ID for each class. In the parsing process, the class name is identified in the index file and added to ORG as a node. When the process identifies the string “Referrers by type”, the referencing class is added and the directed edge is established from this node to the referenced node. When the program identifies “Referees by type”, it reads the referenced class and adds it to ORG with the directed edge.

4.2.2. The Establishment of ORGB

ORGB is a digraph used to express the feature of malicious code. ORGB only collects the classes of malicious code as nodes, and the class list of malicious code is obtained by manual analysis. The flow chart of ORGB establishment module is shown in Figure 8.

4.2.3. Detection Module

In this part, we propose λ-VF2 algorithm. When the value of λ is 1, λ-VF2 algorithm degrades to the original VF algorithm. In the experiment, the results are different by setting λ with different values. The flow of the detection module is shown in Figure 9.

The program first inputs the value of λ and selects ORG and then matches the selected ORG with every ORGB in the malware library. The matching process will be terminated by a successful match. For the convenience of the experiment, ORG and ORGB are stored in binary file with no attribute of nodes and edges.

5. Experiments

5.1. Setup

In our experiments, we run the Android apps and extract original data by the tools we developed. We construct ORG and test it with the malware dataset.

(i) Android Setup. We extract memory data on a real device. Table 5 shows the experiment environment.

(ii) PC Setup. The ORG is sent to PC server. The environment of PC server is shown in Table 6.

5.2. Datasets

We use two kinds of datasets in our experiments, simulative malicious samples, and real malware samples.

5.2.1. Simulative Samples

Each simulative sample is built by manual construction, which consists of two packages. One is malicious and the other is benign. In a given category of simulative malware, the different sample contains different benign packages and the same malicious packages. The advantage of simulating samples is that we can control the scale and operation of malware. In the experiments, we construct 10 simulative samples. The malicious codes in these samples are basically the same. In order to test different effects of VF2-isomorphism, VF2 monomorphism, λ-VF2 isomorphism, and λ-VF2 monomorphism, we adjust the malicious codes to simulative the attacks. Table 7 shows the number of each type in simulative code samples.(1)Origin samples: the malicious codes are the same, and the benign parts are different.(2)Extra Reference samples: this kind of samples is simulating the malware which is intended to avoid detection by adding disturbance reference. The classes in malicious codes are identical. However, compared with the original malicious ones, there are several new meaningless references added between classes.(3)Extra Class samples: new classes are added based on the original malicious codes to simulate the malicious variations.(4)Class Replacement samples: based on the variations of simulative malicious codes, some classes are deleted and some classes are added.

5.2.2. Real Malicious Samples

We also collect two kinds of real malware which is shown in Table 8.

To extract the ORGB of the given category of malware as the dynamic feature, we select some samples from each category randomly and then analyze them manually.

The APK file is generated from packetized dx tools. We use JD disassembler to reverse the source code to obtain the classes. By comparison, we acquire the malicious classes. Classes of malicious codes are generally stored in independent packages, which makes it possible to identify malicious categories manually. Figure 10 shows the file structures in two APKs which contains ADRD malicious codes. Obviously, both apk contains malicious package “xxx.yyy”. In this way, we obtain the list of ADRD.

5.3. Experimental Results on Simulative Samples
5.3.1. Simulation Sample Test Results and Analysis

In our experiments, we first construct ORGB from Origin samples. Then, we construct complete ORG of the 10 samples. Finally, we, respectively, detect ORGB with four kinds of VF2 algorithm. Experimental results are shown in Table 9, where λ is 0.8.

As Table 9 shows, all algorithms can completely detect original malicious codes with new classes added for interference. The reason is that the new classes reflected the new nodes in ORG and ORGB is still a subgraph of ORG. It indicates that our method is effective in the variants added new classes.

VF2 subgraph isomorphism algorithm is unable to detect the attack of Extra Reference. The reason is that some meaningless references are added, which leads to new edges in ORG. However, subgraph isomorphism requires the complete matching of edges; namely, the new edge is required in both ORG and ORGB.

Extra Reference and Class Replacement are incompletely detected λ-VF2 subgraph isomorphism. This is because the impact of the added references is not completely eliminated and the matching condition is overqualified.

λ-VF2 monomorphism has the weakest constraint and is successful in the four kinds of detection. In practice, even the same kind of malicious codes is not totally identical. And the created objects are different in memory. In consideration of these factors, λ-VF2 monomorphism is the best choice. And the effectiveness needs to be verified on real malware samples.

5.3.2. Confused Variation Detection of Simulative Code Samples

Code confusion is the most common technique used in malware. With code confusion, malware can easily hide the malicious characteristics or generate the variations rapidly, which can avoid static detection.

ProGuard is a famous open source code obfuscation tool, which is integrated into Android. To make it usable, “proguard.config= $dk.di/tools/proguard/ proguard-android.txt: proguard-project.txt” needs to be added at the end of the properties file.

In experiments, we utilize ProGuard to obfuscate four Origin simulative samples and regenerate their ORGs. Then, we detect them with the original ORGB by λ-VF2 monomorphism algorithm. Experimental results show that the four ORGs are all matched successfully.

5.4. Experimental Results on Real Samples
5.4.1. Effect of VF2 Algorithm on Malicious Code Detection

The VF2 algorithm is a precise graph matching algorithm, which requires the complete match of the subgraph. This algorithm achieves high accuracy with the low false positive rate. However, the effect of noise leads to the low possibility of complete matching. Thus, the practicability needs to be further tested.

We test the categories of ADRD and Bgserv by VF2 algorithms, and the value of λ is set to 0.8. The experimental results are shown in Table 10 and Figure 11.

As depicted in Table 10 and Figure 11, the success rates of VF2 subgraph isomorphism and VF2 monomorphism are low; the main reasons include the following:(1)The feature of malicious codes is not sufficiently extracted because of the difference between samples of each category.(2)In the process of extracting, malicious process dynamically creates and destroys classes, which leads to the deficient loading of the key feature in the memory.(3)These two algorithms are both precisely matching. And the above two reasons can cause the failure of matching of ORGB and ORG.

It can be concluded that the reduction of matching precision can decrease the effect of noise and achieve high matching accuracy.

5.4.2. Effect of λ-VF2 Algorithm Varying Precision

λ-VF2 monomorphism algorithm is effective in real malicious codes. The value of λ affects a lot on matching results. If we decrease the value of λ, the matching precision reduces and the false positive rate increases when it tends to 0. If we increase the values of λ, the matching precision reduces and the false negative rate increases when it tends to 1. Thus, the proper value of λ needs to be tested.

To obtain the false rate when λ decreases, we use a malware group and a benign app group for each test value. And the benign group has the same number of apps with the malware group. λ starts from 0.5 and increases by 0.05 for each group. We obtain the false negative rate from the malware group and the false positive rate from the benign app group. Experimental results are shown in Table 11.

As Table 11 shows, when λ is 0.9, the miss rate achieves 0.5, which impossibly meets the practical needs. When λ is 0.75, the false rate achieves 0.23, which is unsatisfied. Thus, we select the value of λ from 0.75 to 0.85. The variation of miss rate and the false rate is illustrated in Figure 12. Experimental results are shown in Table 12.

As Table 12 shows, when λ is 0.85, the miss rate achieves 0.69, which impossibly meets the practical needs. When λ is 0.7, the false rate achieves 0.31, which is unsatisfied. Thus, we select the value of λ from 0.7 to 0.8. The variation of miss rate and the false rate are illustrated in Figure 13.

As observed in the two groups of experiments, as λ rises, the miss rate of malicious codes increases while the false rate decreases. These two parameters are a trade-off. In practice, to guarantee that the miss rate and false rate are satisfied, we set the value of according to the needs. From the experiments, it can be concluded that when is around 0.85, we can achieve a better performance.

6. Conclusion

In this paper, we present ORG to depict the references between objects allocated in heap memory and extract ORGB as the feature of Android malware from ORG. We propose Demadroid, a dynamic system for Android malware detection. After extracting ORG in memory, Demadroid matches ORG with the ORGB of each malware category by -VF2 algorithm. Experimental results demonstrate the effectiveness and efficiency of our algorithm. And Demadroid can effectively resist obfuscated attacks and detect the variants of known malware to meet the demand for actual use.

Our important future work is to take the deeper optimization of the graph match algorithm and the ORG establishment. And we can build a virus library in the cloud and combine the algorithm with cloud computing in the future. In this way, our framework can be improved from efficiency and accuracy in various scenarios.

Disclosure

Professor Hui He and Weizhe Zhang are the corresponding authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work is supported by the National Key Research and Development Program of China under Grant no. 2016YFB0800801 and the National Science Foundation of China (NSFC) under Grant nos. 61472108 and 61672186.