Abstract

Source code transformation is a way in which source code of a program is transformed by observing any operation for generating another or nearly the same program. This is mostly performed in situations of piracy where the pirates want the ownership of the software program. Various approaches are being practiced for source code transformation and code obfuscation. Researchers tried to overcome the issue of modifying the source code and prevent it from the people who want to change the source code. Among the existing approaches, software birthmark was one of the approaches developed with the aim to detect software piracy that exists in the software. Various features are extracted from software which are collectively termed as “software birthmark.” Based on these extracted features, the piracy that exists in the software can be detected. Birthmarks are considered to insist on the source code and executable of certain programming languages. The usability of software birthmark can protect software by any modification or changes and ultimately preserve the ownership of software. The proposed study has used machine learning algorithms for classification of the usability of existing software birthmarks in terms of source code transformation. The K-nearest neighbors (K-NN) algorithm was used for classification of the software birthmarks. For cross-validation, the algorithms of decision rules, decomposition tree, and LTF-C were used. The experimental results show the effectiveness of the proposed research.

1. Introduction

Source code transformation is performed in a manner in which the source code of a program is transmuted by spotting any operation for creating an alternative or nearly same program. This is mostly performed in situation of piracy where the pirates want the ownership of the software program. From different perspectives, the transformed source code is mostly equivalent to the original program in terms of semantics. For transforming the source into another program, one usually needs the incorporation of whole front end of programming language, data structure of internal program representation, parsing of source code, understanding of the program, meaningful static analysis, and generation of useable source code for representation of program. Software industry is immensely in front of the software piracy issues. This piracy performed in software can badly affect the software business and eventually big loss to the owner organizations. Stoppage of software piracy is extremely important for the rising economy of the software industry. Different methods are used to prevent piracy of software. These methods include techniques of fingerprinting [1, 2], watermarking [35], and software birthmarks [611]. The watermark has the weaknesses as it can be removed by approaches of code obfuscations and semantic preserving transformation. The similar concerns are existing in the software fingerprints. To overawe these limitations, the idea of birthmark was presented and is broadly acknowledged and known approach for preventing source code transformation and piracy of software.

Birthmark of software is considered as necessary features which can be employed for focusing the identification and uniqueness of software. The common uses of birthmarks are for software theft, identification of transformations in source code, and Windows API. More features of a software birthmark can eventually present the robustness and effectiveness which will further show the precise detection of transformations or theft made in the software or program. Birthmark of software is established on imperative properties, resilience, and credibility [6]. Credibility depicts that the birthmark of software entails that two programs, which is written independently, should be different. Whereas, the resilience should be preserved and not be damaged in any case. Various approaches were considered to show the effectiveness and usability of software birthmark [6, 1214]. These approaches talk about various applications of software birthmark including source code transformation, code obfuscations, software theft, piracy, and many others. The use of software birthmark can protect software by any adaptation and ultimately preserve the ownership of software.

The proposed study endeavored to use machine learning algorithms for classification of the usability of existing software birthmarks in terms of source code transformation. The K-nearest neighbors algorithm was used for classification of the software birthmarks. For cross-validation, the algorithms of decision rules, decomposition tree, and LTF-C were used. The experimental results show the effectiveness of the proposed research.

This study is divided into different sections. Section 2 represents the related work associated to the existing approaches of source code transformation, software theft, and so on. The research methodology of the proposed study is presented in Section 3. Results and Discussion are given in Section 4. The study is concluded in Section 5.

Researchers frequently attempt to devise diverse approaches, methods, and solutions to proficiently and successfully analyze the source code transformation and software piracy. Numerous practices have been adopted in software industry to detect and prevent software theft. The idea of software birthmark was presented to overcome the downsides of software fingerprint, watermarks, and digital signature as these can be modified or removed by using approaches of code obfuscation and transformation of semantic preservation. Birthmark of software was established to powerfully recognize the software theft. First, the birthmark was developed by Tamada et al. [15], which extracts four types of birthmarks: inheritance structure, sequence of method calls, constant values in field variables, and the used classes. With the advancements in the field, birthmark was well thought out as a significant measure of the software in serving to identify software piracy. The field was discovered, and lots of researchers tried to grow a strong and further trustworthy birthmark for finding of software piracy. Software birthmark knowledge was initially considered as sole identification of object by Neufeld [16] in 1992. Derrick [17] discovered the idea and gave the importance to the use of birthmark details for “protecting” software. This was later termed as theft protection of software.

The primary software birthmark was allied with software theft which was offered as a birthmark for Java program theft detection [15]. In the same way in 2004, Tamada et al. considered birthmark of software that was used for detecting the theft in Windows applications. Myles and Collberg proposed “whole program path birthmarks” for detecting software theft [18]. That birthmark method was created on the whole control flow of the software program. Diverse categories of birthmark were planned for software theft detection. The proposed study identified a number of important birthmarks which have been proposed by different researchers for different purposes mostly for theft detection. Spafford and Weeber [19] discovered dissecting executable code for analyzing the structure of data, library calls, and system calls. The idea of software forensics was offered for thoughtful source of virus and malware infection. Birthmark was aimed to facilitate detection of transformation in source code and software theft [2023]. These studies have considered the design of own birthmark according to some defined features of software and then evaluated the effectiveness in term of creditability and resilience of software for identification of theft exists in copies of software.

The authors [24] offered a dynamic program slicing tool built on dynamic birthmark with some inputs; a union of k-gram instruction-sequence sets as birthmark is used for identification of program. Formal description of software birthmarks was offered by Tamada et al. [25] where they proposed an approach of extracting birthmark from the class files of Java. They are sightseen on comparable perception and proposed a framework for evaluating the two significant properties of birthmarks that is resilience and credibility. Zeng et al. [26] devised a framework of semantic-based abstract interpretation for evaluating software birthmark. This model defines two important properties resilience and credibility. The success of the framework is confirmed by static API birthmark and static-gram birthmark. The authors presented a dynamic birthmark for Java that perceives how a program uses objects providing by the Java standard API [27].

For transmuting the source code into an alternative program, commonly, it needs the integration of data structure of internal program representation, whole front end of programming language, meaningful static analysis, parsing of source code, understanding of the program, and generation of useable source code for representation of the program. Various approaches are being used to change the source code. To overcome this issue, the researchers have devised different solutions. The proposed study has used machine learning algorithms for classification of the software birthmarks usability in terms of source code transformation. The K-nearest neighbors algorithm was used for classification of the software birthmarks.

3. Research Methodology

Efforts are made to overcome the issues raised from the transformation of source code and software theft. Researchers mostly considered software ownership and safety as one of the most priorities under consideration. A lot of research studies have been shown for shaping the idea of software birthmarks. Maximum of the birthmark approaches are related to Java source code, which are used for detecting Java theft. Further significant birthmark approaches and techniques works for Windows API [28], for detecting software theft. One of the significant notions used in describing a software birthmark is the usage of software features. Software can be divided into various parts (mostly features) of software [23]. Together, all these features of the software can deliver a faster and reliable identification of the software and then eventually be used for detection of theft. To detect transformation in source code or software theft, the birthmarks of software applications are matched, and similar birthmark identifies software piracy. A number of birthmarks were identified in the literature. The details are given in Table 1.

Figure 1 represents the flowchart of the approach used in the proposed study for software birthmark usability for source code transformation. The figure represents the information table (dataset) containing objects, attributes, and their decision. After the information table, the machine learning algorithms were applied. The K-NN algorithm was applied for the classification purpose. After that, the algorithms of decision rules, decomposition tree, and LTF-C algorithms were applied as cross-validation algorithms.

The dataset developed during higher studies programme was considered for validation purpose of the proposed research. Total of 150 entries were existing with three features. Figure 2 shows the visualization of the dataset for user understanding.

Once the information table was imported to the proposed system, initially, the reduct was applied. After that, rules set were generated. Figure 3 depicts the publications with the year in the given dataset.

Figure 4 depicts the rules set generated from the proposed study.

After doing this process, in last, the algorithm of K-NN was applied to the proposed research. Some cross-validation algorithms were used which are discussed in the Results and Discussion Section.

4. Results and Discussion

Several research studies have been conducted for refining software birthmarks for detection of piracy. The existing approaches used for detection of piracy are given as intellectual software asset management [8], detection of software theft [18], plagiarism detection [38], detecting java theft [39], detecting binary theft [40], semantics-based repackaging detection for mobile apps [41], malware detection [42], detecting code theft [43], detecting the theft of natural language [44], credible, resilient, and scalable detection of software plagiarism using authority histograms [45], detecting plagiarized mobile apps [46], efficient similarity measurement technique of Windows software [47], detecting common modules in Java packages [48], measuring similarity of android applications [49], identify similar classes and major functionalities [50], moreover, for the source code level [48], and so on.

The proposed study has used the application of machine learning for software birthmark usability for transformation of source code. Initially, the K-nearest neighbors algorithm was used for classification of the software birthmarks. The experimental results of K-NN were effective and showed an accuracy of 98%. Figure 5 represents the frequencies of the dataset in term of conference, journals, books, and thesis.

Figure 6 shows the comparisons of the algorithms used in the proposed research. The algorithm decision rule has 0.91%, decomposition tree algorithm is having 0.96%, and the LTF-C algorithm is having 0.64% accuracy.

Figure 7 graphically represents the coverage of the algorithms used.

5. Conclusion

Software industry is growing with the passage of time. New innovations are offered to cater diverse issues of real life. The role of software applications has evidenced the success of software industry. Pirates are engaged with code transformations and gaining profit from the code obfuscation, transformation of source code, and piracy of software. This is mostly carried out in situations of piracy where the pirates want the ownership of the software program. Various approaches are being practiced for source code transformation and code obfuscation. Among the present approaches, software birthmark was one of the approaches developed with the aim to detect software piracy exists in the software. Birthmarks are considered to insist on the source code and executable of certain programming languages. The proposed study has used machine learning algorithms for classification of the usability of existing software birthmarks in terms of source code transformation. The K-nearest neighbors algorithm was used for classification of the software birthmarks. For cross-validation, the algorithms of decision rules, decomposition tree, and LTF-C were used. The experimental results show the effectiveness of the proposed research. The algorithm decision rule has 0.91%, decomposition tree algorithm is having 0.96%, and the LTF-C algorithm is having 0.64% accuracy.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was sponsored in part by the Research Fund for the Doctoral Program of Liaoning University of International Business and Economics (2019XJLXBSJJ002).