Abstract

Malware on devices connected to the Internet via the Internet of Things (IoT) is evolving and is a core component of the fourth industrial revolution. IoT devices use the MIPS architecture with a large proportion running on embedded Linux operating systems, but the automatic analysis of IoT malware has not been resolved. We proposed a framework to classify malware in IoT devices by using MIPS-based system behavior (system call—syscall) obtained from our F-Sandbox passive process and machine learning techniques. The F-Sandbox is a new type for IoT sandbox, automatically created from the real firmware of the specialized IoT devices, inheriting the specialized environment in the real firmware, therefore creating a diverse environment for sandboxing as an important characteristic of IoT sandbox. This framework classifies five families of IoT malware with F1-Weight = 97.44%.

1. Introduction

The rapid growth of the fourth industry, development of the Internet of Things (IoT), leads to an unprecedented revolution in the cyber-physical systems and provides rich utilities to users. It envisaged the number of interconnected devices to exceed 50 billion by 2020, with an estimate of about 8 devices per person [1]. Such an enormous amount will impact our digital lives in many application domains, transportation, healthcare, smarthome, smartcity, medical and health equipment, energy management, etc. However, in parallel with the developing IoT technology, there is a security issue of information leakage when anything can become a spy device anytime, anywhere. Since the last decade, the number of malware on IoT devices has exploded. In the first half of 2018, there were more than 120,000 IoT malware instances detected by the Kaspersky IoT Lab [2]. That’s more than triple the amount of IoT malware seen throughout 2017. Kaspersky Lab warns that the snowballing growth of malware families for smart devices is a continuation of a dangerous trend: 2017 also saw the number of smart device malware modifications rose to 10 times the amount seen in 2016.

Recorded attacks showed that target IoT devices have become critical. In September 2016, IoT malware built from Linux/Mirai malware was responsible for 1.1 Tbps DDoS attacks directed at the Dyn Domain Name System (DNS) provider [3]. In 2017, Linux/Brickerbot, a botnet similar to Mirai, infected more than 10 million IoT devices around the world [4]. There are many vulnerabilities that attackers can use to obtain privileges for IoT devices. OWASP has identified the ten main issues [5] in which insecure firmware, insecure web interface, and insufficient authentication are mentioned.

These problems made researchers do further study [613] and attracted the widespread attention of researchers such as IoT malware. These methods can be divided into two main categories. Static methods are expressed by analyzing and detecting malicious files without executing them. In static malware analysis, analysts reverse an executable file into assembly code to deepen their understanding of malware activity. The static analysis [8, 9] relies on extracting various characteristics from the executables such as Printable String Information (PSI), Function Length Frequency (FLF), Operational codes (Opcodes), n-gram of byte sequences, header section, and so on. Dovom et al. [11] used a fuzzy pattern tree for the opcode of the executables to detect and classify in the IoT nodes. One of the major advantages of static analysis is the ability to observe the structure of malware. All possible execution paths in the malware sample can be detected without considering the processor architectures. Although static analysis has its advantages, it has some limitations. The key disadvantage of this approach is that it is unable to detect complex and polymorphic malware. Therefore, the static analysis approach alone might not be sufficient to identify malware and should be complemented by the dynamic analysis [14]. The dynamic approach consists of monitoring executable files during its run-time to detect abnormal behaviors. This approach performed by collecting information such as API calls, network behavior, instruction traces, registry changes, memory writes, and so on during the running process. Based on these captured data, researchers have built machine learning (ML) or deep learning (DL) models to detect if a target file is a malware or not [11, 15, 16].

Almost all malware research is focusing on computing devices with the Intel architecture (x86-64) [17, 18] and recently has switched to develop frameworks to detect IoT malware such as [4,15,19,20], especially with the ARM architecture. There have been a lot of frameworks to collect system calls of malware on the computer and mobile devices and automatically analyze by machine learning. Deep4MalDroid [15] extracted the Linux kernel system calls from the executing apps on Android, generates a weighted directed graph, and then applies a deep learning framework resting on the graph-based features. Martin et al. [20] proposed a framework for dynamic and static analysis with a large dataset extracted from Android applications and detected malware using a fusion of features and voting between classifiers. However, no framework can collect system calls of IoT samples, automatically analyze their logs, and then measure classification model in a holistic of a full IoT dataset.

In general, the above frameworks have three main components: collection of IoT executable samples (including malware and benign executables), behavior extraction, and detection/classification. Concerning the first component, IoTPOT developed by Pa et al. [19] was the first honeypot to mimic IoT devices, allowing the authors to capture more than 4,000 IoT malware samples (according to [9]). Another IoT malware database, we can mention, contains with more than 9,000 samples [21]. Besides IoT malware samples, it is crucial to collect benign files to be able to implement detection algorithms. Brash [22] has created 1,078 benign and 128 malware samples for ARM-based IoT applications. Similarly, Nguyen et al. [16] have collected 10,033 ELF files including 4,002 IoT malware samples and 6,031 benign files from different sources. Samples were then labeled as malware/benign with VirusTotal [23] before inputting for classifiers.

The second component comprises analyzing and logging executable files during its runtime to detect abnormal behaviors. To perform this approach, the most important part is to build a sandbox, called an emulator, so that executable files reveal all their behaviors. In such an environment, malware can only affect the virtual environments and not physical devices. Researchers use QEMU [24], a very popular open-source system emulator, to deal with this problem. QEMU supports many types of processors such as ARM, MIPS, PowerPC, x64, x86, and so on that are popularly used in embedded devices [25]. Some works are focusing on developing sandbox such as [21, 2628]. The sandboxes are used as a core of IoT malware analysis and detection frameworks such as [4, 19]. However, these frameworks have drawbacks that limit the detection capabilities of malware. IoTBox [19] has built a sandbox to capture and analyze Telnet behaviors of IoT malware used for DDoS attacks. This approach could be useful for detecting network abnormal behaviors, but it cannot detect malware that behaves mostly inside the operating system of the device such as Linux/TheMoon [29]. Costin et al. [4] proposed a framework to collect dynamic malware features based on the open-source Cuckoo sandbox [26] to determine whether a Linux/Elf file is a malware or not. However, this framework is not presented on how the collected data were analyzed and the precision, accuracy of the obtained results. Rare [30] focused on how to activate malware on Router by discovering static and dynamic information to build a suitable environment for malware. But Rare only used OpenWrt to build emulated router, did not emulate NVRAM, and did not mention detecting malware on the router. In the same approach, Chang et al. [31] proposed IoT sandbox which can support 9 kinds of CPU architectures including ARM (EL, HF), MIPS, MIPSEL, PowerPC, SPARC, x86, x64, and sh4. Then, machine learning- or deep learning-based classifiers are used for classifying malware and benign files.

Current sandboxes have been built on basic environments, including a common Linux-based operating system and several additional monitoring tools. This strongly impacts on capture behaviors of a target executable file and the whole detection process. Firmwares are built and packaged for specialized devices, with specialized functions in many environments that have little in common. Many firmware programs cannot run in traditional malware analysis sandboxes based on basic environments. It is also hard to install required environments like firmware on the basic environment because it packages them in unpublished firmware. Therefore, we need to generate multivendor environments like firmwares. That means IoT sandbox can set up not only a basic environment but also many environments like firmware of physical devices. To our knowledge, there are currently no sandboxes that can build environments based on an actual device firmware, can emulate NVRAM of devices, and collect enough syscalls to classify malware.

The MIPS processor architecture [32] appears in many network devices such as routers, wireless transmitters, and cameras [6, 25]. The program’s behaviors are different when running on every different processor architecture and operating system. Therefore, studying malware on devices using the Embedded Linux operating system and MIPS processor (MIPS ELF) is necessary. Many studies have focused on detecting malware through the syscall behavior with good detection results [3335], but, to our knowledge, research has not been carried out to detect malware via syscall in MIPS ELF. Previous studies focused on detecting malware on Windows operating system for the i386 processor architecture. Canzanese et al. [33] evaluated with many machine learning methods or deep learning applied with positive results. It can also detect malware on the Android operating system via syscall like [34]. Asmitha et al. [35] experimented with the method of detecting malware on Linux through syscall with many machine learning methods, but the dataset is small with only 668 samples and also done on the i386 architecture.

In this paper, we propose a novel framework that consists of analyzing MIPS ELF files based on syscall behaviors. We have collected and created a database of 3,773 MIPS ELF malware samples from Detux [21], IoTPOT [19], and VirusShare [36]. We have built a specific QEMU-based sandbox, which was inherited from Firmadyne [25] and Detux [21], aimed at monitoring system call of the executable file. This sandbox, named F-sandbox, can self-configure the suitable requirement for a target MIPS executable file to reveal all behaviors. Within this framework, we also implement popular machine learning classifiers such as SVM, Random Forest, and Naive Bayes to evaluate the obtained data. The experiments show that our proposed framework can classify malware on MIPS with high accuracy by F1-Weight up to 97.44%. Our main contributions in this paper are(i)Proposing a novel sandbox which automatically sets up the adaptive environment for activating MIPS ELF files.(ii)Comparing and selecting a suitable method of extracting features and machine learning classifiers for MIPS-based IoT malware detection purposes.(iii)Combining feature selection, feature reduction and machine learning methods with suitable parameters to build a novel framework for detecting MIPS-based IoT malware with high precision.(iv)Supporting the open-source activities, we make our system available to the research community under an open-source license (GPL) to encourage further research into IoT. For more information about source code, please see https://gitlab.com/Nghiphu/c500-sandbox. The F-sandbox system and malicious code detection module are provided at http://firmware.vn, in which the IoT dataset and instructional materials are updated and provided to the community.

The remaining sections of the paper are organized as follows: Section 2 shows related works; Section 3 presents the F-Sandbox; Section 4 describes the IDMD framework; Section 5 is about experiments; finally, the conclusions and directions of the next development are with Section 6.

To perform the dynamic analysis, the sandbox has a very important role. There are 2 types of sandboxes that are physical and virtual. Physical sandboxes are based on real hardware components such as RAM, CPU, and network peripheral that give a better environment for the malware to reveal all their behaviors. However, physical sandboxes are difficult to customize, to store/restore their states. The type that is not suitable for IoT-based malware because of the diversity of hardware in IoT devices. Different from PC, IoT devices have various processor architectures such as MIPS, ARM, PowerPC, and SPARC. To deal with this problem, Virtual sandbox, based on the simulation and virtualization technology, is often used for monitoring executable files. In general, a virtual sandbox is built based on QEMU [24], a generic and open-source machine emulator and virtualizer. QEMU supports 26 different CPU architectures, especially IoT processor architectures such as MIPS and ARM and supports Windows and Linux operating systems. The main challenge of the virtual sandbox is the difficulty of setting up the same function configuration, especially network peripherals. There are two main virtual sandboxes that we can mention, which are IoTPOT [19] and Detux [21].

Detux is an open-source sandbox based on QEMU that supports traffic analysis of the Linux malware in five different CPU architectures: MIPS, MIPSEL, ARM, x86, and x64. The first problem of Detux is that it does not virtualize network peripheral so malware can infect other devices through external connections. Second, the interaction with the operating system was not considered, so the information such as file creation, file deletion, or read file are missing. Finally, the operating system executes files in Debian Linux, and it is not a common environment as used by IoT devices; therefore, it could be not suitable for some malware.

IoTBox [19] is a sandbox aiming at analyzing network behaviors of IoT malware. It supports 8 different processor architectures such as MIPS, ARM, MIPSEL, PowerPC, etc. IoTPOT can detect common types of DOS attacks such as SYN flood, UDP flood, ACK flood, and scan (Telnet, UDP, and TCP scan in some special port). DOS traffic will be blocked or allowed with low packet frequency to avoid harming the real system. However, limited to emulating only IoT devices from a few specific vendors, IoTBOX is not suitable for analyzing all kinds of malware that can infect IoT devices. When executing malware and monitoring, DOS attack behavior may not be immediately executed; therefore, we have to wait for a long time to observe it.

Sandboxes capture two main data sources during the execution of target files that are system calls and network behaviors. Syscalls can be collected with different tools such as Strace or Kernel probes.(i)Strace [37] is a tool on Linux operating system that allows monitoring running programs, collecting the system calls including the name, parameters, and results of calling the system calls. Strace allows tracking processes generated from the process being tracked by Strace. Strace has been used in many works to collect the interaction between executable files in the Android environment [34] or Linux [35].(ii)Kernel probes abbreviated as Kprobes is a tool that allows dynamically breaking into kernel routine and collecting debugging information, one of the information kinds is a syscall. Our sandbox used Kprobes built-in Kernel to monitor syscalls of all processes running on the firmware of an IoT device. Sometimes using Strace is not suitable for a tracking system. First, many processes could run before starting Strace; therefore, we can miss system calls. Second, Strace can be easily resisted by some antidebug techniques such as check ptrace status in source code. During the process of collecting syscalls, in our experiments, we have encountered such malware samples.

Concerning network behaviors, the popular tool that could be used is InetSim [38]. InetSim is a Linux-based software package, containing Perl scripts used to simulate many network services such as DNS, HTTP, and FTP. PynetSim [39] is considered an upgrade to improve InetSim for IoT devices. PyNetSim is developed in Python3, which allows detecting malicious code protocols, supporting scripts to interact with IoT malicious codes like bot DDoS, Mirai, and LizardStresser.

It is important to note that these previous works did not focus on virtualizing network peripheral. This point limits considerably monitoring the IoT malware behaviors, especially IoT botnet malware. This issue is indirectly solved by Chen et al. [25]. They proposed a framework, named Firmadyne, aiming at emulating router firmware’s web interface using QEMU. Firmadyne can autoconfigure a suitable emulation environment for a wide range of the router, enabling a dynamic analysis of 23,035 firmware images gathered from 42 devices vendors. This method does not rely on physical hardware to perform the analysis like Avatar [40], but Firmadyne emulates the firmware nonvolatile memory to execute the firmware web interface. Once the router web interface is emulated, the popular scanning framework Metasploit is used for exploring vulnerabilities and its corresponding exploits. However, Firmadyne analyzes only the web interfaces and ignores the firmware operating system execution. Hence, these methods cannot trace abnormal behaviors to detect malware as shown in the analysis of Linux/TheMoon and Linux/Mirai malware experimentation. Based on Firmadyne, we have built a novel sandbox that can set up the network configuration for most IoT executable files including malware and benign. This sandbox, named F-Sandbox, is presented in the next section, and it is a part of the IDMD framework.

3. F-Sandbox Architecture

Our proposed sandbox uses instrumentated kernel that was built with Kprobes allowing tracing Linux kernel function calls and any instruction inside the kernel and inspecting the registers. Using INetSim/PyNetSim to simulate the Internet, the F-Sandbox provides a full environment that reveals behaviors of an ELF file, including both syscalls and network interaction [41] during its runtime. The collected results from F-Sandbox will be static analysis information of the sample, network data generated in the pcap’s form file, log INetsim/PyNetSim about information interacting with the Internet, and other system behavior in the form of a file log. F-Sandbox structure has 4 main components: Sandbox Controller, Virtual Machine, QEMU Monitor, and INetSim/PyNetSim server shown in Figure 1.

(i)The Sandbox Controller interacts with the QEMU monitor through calling commands that display network configuration and restore snapshots. The sandbox controller interacts with the virtual machine by calling SSH/Telnet procedures to the virtual machine, transmitting executable files from the real machine inside, granting execution rights, requesting file execution, and getting log files. It supports two methods, SSH and Telnet, to connect and interact virtual machine while Detux only supports SSH; this is our improvement over Detux. The sandbox controller operates the virtual machine by Telnet protocol; the file will be transferred to a virtual machine by wget, ftp, or tftp instead of scp depending on the specific environment of the firmware.(ii)The virtual machine based on QEMU includes 2 components: QEMU image and Linux kernel. The image of QEMU used in F-Sandbox containing the file-system is extracted from the firmware by the Firmadyne Extract tool. More tools are added to it and its configuration info is modified to emulate in QEMU but still fully inherit the packaged environment of the firmware. The Linux kernel used in the F-Sandbox is the F-Kernel which improved from the Firmadyne instrumented kernel. The instrumented kernel of Firmadyne can only listen to 15 syscalls of the system; therefore, we extend to listen to all syscalls of the system aiming to collect all malicious code behavior. Collecting all syscalls can affect the speed of the F-Sandbox; in many situations it is only necessary to collect syscalls with a high frequency; therefore, the F-Kernel allows flexible configuration of syscall quantities to be collected with different thresholds by setting syscall parameter when starting the F-Sandbox. The virtual machine interacts with the INetSim server through sending requests (HTTP, FTP, DNS, etc) and recovers fake responses from INetSim. enp0s3 is an Ethernet Network Interface of the host machine, eth0 is the Ethernet Network Interface of the virtual machine. A virtual bridge connects host machines and virtual machines.(iii)The monitor performs virtual machine interaction through snapshot restore; it is the same function of Detux.(iv)InetSim server performs a simulation of the Internet environment; it will redirect all network traffic to the InetSim server via IPtables and DNS poisoning. This is one of the best tools for simulation of network service; it helps us to emulate the network environment to make malware reveal behaviors as connecting to a server, scanning port without harming the real Internet environment. The INetSim log will record the interactions with the Internet.

We use QEMU as shown in Figure 2 to emulate firmware with loaded ELF files. A good deal of firmware had an SSH server, other ones had only Telnet server. Hence, the F-Sandbox controller provides both Telnet and SSH connectors to interact with the two types of firmware. Our method of transferring files to the virtual machines is also very diverse, we created an FTP server, an Apache web server, and a TFTP server used by the virtual machine to download malware file and execute it.

The sandbox controller calls a procedure to start the virtual machine and QEMU monitor and initializes the virtual machine from the existing state by using the image recovery function or configuring the network, installing required packages and creating a snapshot to restore after running a sample.

4. IoT Dynamic Malware Detection Framework

We propose an IoT dynamic malware detection (IDMD) framework to learn and classify IoT malware, all steps shown in Figure 3. The IDMD framework includes two phases: classification model generation and classification. In the classification model generation phase, the feature vectors of labeled ELF samples are extracted by feature vectors generation function, and, then these feature vectors with their labels are used to train a model. The feature vectors generation function includes three steps: sandbox initialization, sandbox execution, and feature selection. In the classification phase, unlabeled samples will be extracted by the feature vector generation function with generated parameters from the classification model generation phase; then the extracted feature vector will be classified by the generated model from the classification model generation phase.

4.1. Sandbox Initialization

As mentioned above, our sandbox can build adaptive environments with samples by using suitable firmware. To select the suitable firmware for an ELF file to be executed, this step extracts necessary information such as Architecture (MIPS, ARM, etc), the endianness (Big-endian/Little-endian), and bit-length (16/32/64). This static information is extracted by using utility in Linux. It helps our framework to select suitable firmware, among our database of 253 real MIPS-based firmwares of D-Link, TP-Link, Asus, Belkin, Linksys, etc. The selected firmware is used to initialize our Sandboxes. The selected firmware has many parameters such as what types of transfer protocol they supported (FTP, TFTP, SCP) and what control protocol they supported (SSH, Telnet). This information will configure the sandbox controller and start and control the sandbox. It is not trivial to determine which firmware is the most suitable to an ELF file; therefore, for each ELF file, our sandbox loads selected firmware one by one to perform with and the file with the most captured system-call will be kept for the next steps.

4.2. Sandbox Execution

After initializing the sandbox with suitable required parameters, the sandbox controller transfers the input ELF file into the sandbox. Then, by granting all permissions to the ELF file, F-Sandbox starts the monitoring function for 30 seconds. Kprobe will collect the generated from the running program and its child processes. Therefore, it will be analyzed and filtered to capture all log by the input ELF file. If the malware sends a network request outside, it will be automatically redirected to INetSim via Iptables and the built-in DNS-poisoning technique in InetSim, and INetSim server will return the fake response of the same type as the object that the template requires. It presents a sample of captured syscall log in Figure 4.

4.3. Feature Selection

To avoid the training time-consuming and reduce the risk of overfitting, we try to extract the most relevant features related to the label and then apply machine learning algorithms to improve the performance of the model. In this paper, there are two main selection steps:(i)The selection of quantity and name of among log files(ii)The selection of feature vectors among n-gram features generated by selected

The MIPS architecture Linux operating system has 345 different . However, they are not used by performed ELF files, especially ELF malware. The average used syscalls by 3,223 ELF files is about 146. The syscalls and are the most used ELF files, which represent 86% of all captured log files. The 30 most frequent represent over 99% of the whole captured ones. In our experiments, the more the syscalls monitoring, the slower the system performance. When we try to implement F-Kernel monitoring of more than 45 syscalls by Kprobe, F-Sandbox slows down, leading to programs not working because of a timeout and the calculation time. F-Sandbox used Kprobe which is dynamic instrumentation by trap instructions, and then looking up handlers will incur an enormous overhead. It is because of this reason that when we try to implement too many syscalls, our system will slow down. Therefore, instead of taking into account all captured ones, we remove every except for the 30 most frequent ones and then apply the n-gram technique to construct feature vectors afterward.

The n-gram is the method of counting the occurrences of n elements standing close to each other in a sequence; it stores the number of these occurrences in a vector to characterize that string. This method has proven to be effective in detecting malware based on syscall [17]. In our research, we use n-gram with n = 1, 2, 3 to get the feature vectors of the sequence. With the selection of the syscall set as above, the vector is characterized by 2-gram and 3-gram directions with how to choose 30 features are 30  30 = 900 features and 27,000 features.

To select n-gram features that have the strongest relationship with the label, efficient feature reduction techniques based on scores such as Chi-square or Information Gain can be used [42]. But Chandra and Gupta [43] said that in most of their cases, Chi-square was ahead or at part with Information Gain. In our experiments, Chi-square is more efficient than Information Gain. Therefore, our framework used Chi-square as a feature reduction method. The Chi-square score is computed between each feature and the target; afterward, we select the desired number of features with the best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population, and it is the technique we chose for selecting the most relevant features in this paper. We will rank features according to the following formula:where when the content of D contains a characteristic; otherwise, ; when D belongs to class c; otherwise, . N is the observed frequency of D and E is the expected frequency. We choose the number of feature vectors based on the ratio between the total features scores and the subfeatures scores ranking order by descending. We determine that 350 features having the highest score represent 99.9% of the total score of 900 features (n = 2) as shown in Figure 5. These selected features will be then inputted to the classification step.

4.4. Classification

The machine learning methods are well studied in detecting malware; each method will give the result for different datasets. Souri and Hosseini [44] have reported many recent studies and evaluated three popular research methods with malware detection, and the best result is Support Vector Machine (SVM), Random Forest (RF), and Naive Bayes (NB). The SVM classification method has performed well on traditional classification tasks, and we selected it for the text classification systems and intrusion detection systems and achieves high accuracy when combined with the n-gram feature extraction method. The advantage of this method is that its general capability can be improved by using the principle of structural risk minimization. SVM uses a kernel function to map training data into a higher-dimension space so that the problem is separable. Decision Tree is a structured hierarchical tree used to classify objects based on a series of rules. Random Forest uses many trees and forecast by averaging the predictions of each component tree. Naive–Bayes is a probability-based classification method, which gives good results in detecting malicious code [45].

5. Experiments

5.1. MIPS IoT Samples

An IoT sample dataset used for testing is a set of 3,403 MIPS ELF samples, including 3,223 malware samples and 228 benign files. Malware datasets are collected from 2 different sources: Detux [21] and IoTPOT team [19]. Detux has collected more than 9,000 samples, including more than 3,200 MIPS ELF samples. IoTPOT has collected 4,000 samples, of which 938 MIPS ELF samples and only 38 samples coincide with the MIPS ELF pattern of Detux. To label the experimental samples, there are 2 main options which are an evaluation based on a reputable antivirus like Kaspersky [46] or antivirus software combination [18]. We labeled the templates based on Virustotal [23], first identify malicious designs that are Virustotal samples discovered by over 10 software, including reputable software Kaspersky, Avast, Avg, Symantec, we got 3,223 malicious samples. We then labeled this 3,223 malwares based on the antivirus combination, which uses Symantec’s main label, the antivirus engine has a good detection capability and a clever naming scheme. In the malicious dataset, there are 37 different families, many popular and popular malicious codes like LightAidra/Aidra/Zendran, Dofloo/Spike/MrBlack/Wrkatk/Sotdas/AES.DDoS/DnsAmp, Gafgyt/BASHLITE/Lizkebab/Torlus, Tsunami/Kaiten, SecurityRisk, Moose, Hajime, Trojan.Gen, etc., but to ensure the classification accuracy and avoid the imbalance in the number of samples of the classes in the training process, articles select only 4 typical malicious family names and have a sample number greater than 100. The distribution of the malicious codes in the IoT dataset is shown in Figure 6.

We collect the benign dataset from the basic programs available on Embedded Linux, BusyBox built-in programs, Embedded Linux kernel, and some MIPS-based basic applications. The number of clean samples collected is 280 samples. Devices using MIPS chips are specialized devices, there are very limited application stores, and the number of clean programs collected is therefore also limited. Thus, the input dataset has five labels, four malware labels including LightAidra, Mirai, Gafgyf, Dofloo, and one Benign label.

5.2. Collecting Syscall Log

According to [17], the syscall log must be collected within a certain time and the minimum length of the syscall log is 1,500, which will show the result in the best detection with malware on Windows i386 architecture. The numbers generated during the first 30 s focused sandbox implementation, when increasing this time to 60 s, 90 s, 120 s, the syscall number obtained does not change much, so we continue to collect the log for each sample in 30 seconds. The average syscall log length of the sample set is lower than the threshold of 1,500 syscalls, and for each type of malware, it gives the average length of the different logs, which is the average length of malicious code families shown in Table 1.

By analyzing the collected logs when executing the templates in F-Sandbox, the log files of length less than 50 are generated by samples with errors such as insufficient parameters to operate, lack of libraries, errors initialize, and so on. A sample when normally active has a length of syscall log more than 50; the number of programs with several syscalls over 1,500, such as [17], is not large; most programs call the number of syscalls in the range from 50 to 500 as shown in Table 2. The range of syscall’s length MIPS ELF files is shorter which can also be explained because programs written on IoT are often simpler on multipurpose computers due to constrained resources and functions.

From the initial process created when running the sample, the malicious code creates a lot of child processes, in which malicious code creates 1,000 child processes, and most malicious code creates 3 child processes. The number of generated child processes is shown in Figure 7. Therefore, when collecting the syscall log of the sample, it is a syscall log of all processes generated when executing that pattern.

Analyzing the collected log file, the malware only uses 136 syscalls, the benign set uses 127, most of the syscall of 2 episodes overlap, and there are 160 syscalls appear in both sets. Distributing irregular patterns, 2 syscalls read and send to occupy 86% of the system. Detailed statistics of syscalls is shown in Figure 8. The data feature syscall log is also consistent with the published behavior of malicious code sending information to the destination.

Through analysis, malware on IoT devices also has hidden features such as detection of analysis programs. Several malware samples can detect Strace tracking, so it does not activate them. Some previous observations suggest that malicious code on IoT is simple but still uses common hidden techniques of malware.

5.3. Performance Measures

We measure the effectiveness of the classifiers in terms of precision, recall, and score. It computes these three quantities on a per-class basis. The precision for a class is the fraction of processes classified as :where and are the numbers of true positives and false positives predicted by the classifier, i.e., the number of processes correctly and incorrectly classified as belonging to . The precision measures the relevance of the positive classifications. A precision of 1 indicates that the classifier is always correct when it classifies a process as belonging to class , whereas a precision of 0 shows it is never correct when it does so. The recall of class is the fraction of the processes belonging to in the ground truth that are correctly classified:where is the number of false negatives, i.e., the number of instances of misclassified as belonging to another class. The recall measures the sensitivity of the classifier. A recall of 1 shows that a classifier correctly identifies every instance of class , whereas a recall of 0 indicates that a classifier never identifies instances of . The score of a class is the harmonic mean of the precision and recall of that class. An score of 0 indicates 0 recall or 0 precision, whereas an score of 1 indicates perfect recall and precision. In this study, per-class scores are averaged over all the classes to provide an overall characterization of a classifier. We consider three averaging techniques accounting for the unbalanced representation of the malware classes used in this study.(i)Microaveraged score: the score computed from the aggregate set, characterizing classifier performance on large classes(ii)Macroveraged score: the average of the scores of the classes, characterizing classifier performance on small classes(iii)Weighted score: the weighted average of the scores of the classes, with weights proportional to their support in the ground truth

5.4. Training and Evaluation

We installed the test based on the Scikit-learn Python library 0.19.2 on laptop Core i5 MacBook Pro, Ram 16 GB. Build 10 datasets generated from the IoT dataset according to the minimum length of syscall log n, with n = 50, 100, 150, 200, 250, 300, 350, 400, 450, 500 to determine which threshold is the most suitable for detecting MIPS ELF malware classification. If the length of the selected log file is too small, there will not be enough information to determine the program characteristics. On the contrary, if the length of the log file is too large, many programs that simply click are not evaluated and the ability to detect malicious code will be delayed. The goal is to determine the threshold as small as possible and which machine learning method is most suitable. The smaller the threshold is determined, the sooner the malware is detected.

With the method of extracting n-gram characteristics, if n is too small (1-gram), the information got shows only the frequency of single syscall occurrences. If n is too large, the number of the characteristic collections is very large and the characteristics obtained are sensitive to the transformation techniques of malware. Recently studied experiments [33, 46] gave the best results with 2-gram and 3-gram, so 2-gram and 3-gram are selected in this model test.

After extracting the feature vector by the n-gram method, Chi-square is used to reduce the dimensional of the feature vector to K dimensions. Based on the experiments in [46, 47], we set K = 350.

We experimented with the dataset to select the best parameters for SVM, NB, and RF using the grid search method. For SVM, we classify using the OVO method, use the Sigmoid kernel, and find the optimal parameter C in the range (1–100,000) and the gamma value in the range (0.0000001–1). For RF, we find two pros-parametric parameters, the feature function, which is one of three functions (sqrt, log2, sqr) and the number of n_estimators in the range (2–700).

Cross validation is a popular method in machine learning to assess a detection/classification method’s outcomes gained from experiments that can be generalized into an independent sample. It consists of many techniques such as Repeated Random Sub-sampling, K-fold, and Leave-Out. K-fold involves randomly dividing the set of observations into K groups, or folds, of equal size. The first fold is treated as a validation set, and the method is fit on the remaining K−1 folds [48]. The K-fold validation is suitable for limited size datasets [49]. In the experiments, we use a 5-fold cross-validation method with the selected parameter set in the training and evaluating step. It divides data into different five sections; four of them are used as training data and the remained section is used as testing data for each experiment. Measures are calculated as the average of five times in these experiments.

5.5. Experimental Results

The table in Figure 9 represents the overall results of F1-Macro, F1-Micro, and F1-weight corresponding to different machine learning methods and threshold syscall lengths. The results show that the models for the measurements are approximately the same, altogether about one value and there are no many deviations that demonstrate a good model.

The charts in Figures 1012 compare the results of identification methods RF, SVM, and NB with 2-gram and 3-gram features. Experimental results show that 2-gram gives the best value for the classifier with all three algorithms RF, SVM, and NB.

The charts in Figures 13 and 14 compare the F1-Weighted value of 3 sets classifier with different minimum syscall log thresholds. The experimental results show that RF gives the best classification ability, then to SVM and NB. From both charts above, it can be seen that the minimum syscall log threshold for classifying malicious code on the MIPS ELF set is 400.

From the experimental results, we can see(i)With MIPS ELF samples, our framework gets the highest results with the 2-gram feature extraction method with the Random Forest classification method, the minimum syscall length threshold is 400. Compared with the minimum length of the syscall log evaluated by Canzanese et al. [33], with malware dataset on Windows is 1,500, the MIPS ELF file is smaller; the reason why it can be explained is because of the simple embedded program sets. There are fewer functions than Windows, which is also the feature of embedded software. That may also be the reason for the 2-gram explanation for the ability to classify better than 3-gram.(ii)The results in the IoT dataset are good and gather classification from objective sources, thus proving that F-Sandbox works correctly and efficiently.

6. Conclusions and Future Work

In this paper, we proposed a framework for detecting and classifying MIPS ELF malware based on syscall and machine learning methods. This is the first study specializing in malware in MIPS ELF. In particular, the main contribution is MIPS IoT dataset, a novel type for IoT sandbox, based on the improvement and integration of Detux sandbox and Firmadyne emulation. We also collected, standardized the MIPS IoT dataset. In theory, the study has shown many characteristics of the malware type in MIPS ELF, finding the most suitable methods and parameters for detecting MIPS ELF malware based on machine learning methods. To the best of our knowledge, this paper is the first to provide a comprehensive framework, including not only an overall methodology but also a sandbox environment and related tools, for collecting MIPS-based malware syscalls, classifying them to their respective families, and evaluating the classification performances. We provided the generated IoT dataset used in our experiments and the source code of our sandbox tool to researchers at sites https://gitlab.com/Nghiphu/c500-sandbox and http://firmware.vn for academic purposes.

In future work, we continue to develop F-Sandbox so that malware can execute at high rates, exposing more behavior. In this framework, we have only exploited the information about syscall obtained from F-Sandbox to detect malware; other information such as network behaviors, process states will be further considered and researched. Static information not only uses to initialize F-Sandbox but also combines with dynamic behaviors to detect/classify malware. Therefore, our works in near time integrate static information, static feature, system behaviors, and network behaviors are that extracted from F-Sandbox to detect/classify malware. F-Sandbox supports and experiments for the MIPS architecture; therefore, next time, we will deploy for another architecture such as ARM, PowerPC.

Data Availability

Supporting the open-source activities, we make our system available to the research community under an open-source license to encourage further research into IoT. For more information about source code, please see https://gitlab.com/Nghiphu/c500-sandbox. The implemention of F-sandbox system and malicious code detection module are provided at http://firmware.vn, in which syscall logs of C500-IoT dataset is also updated regularly and provided to the community.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partially funded by Ministry of Science and Technology of Vietnam, grant number KC.01.19/16-20.