Abstract

Source code management systems (such as Concurrent Versions System (CVS), Subversion, and git) record changes to code repositories of open source software projects. This study explores a fuzzy data mining algorithm for time series data to generate the association rules for evaluating the existing trend and regularity in the evolution of open source software project. The idea to choose fuzzy data mining algorithm for time series data is due to the stochastic nature of the open source software development process. Commit activity of an open source project indicates the activeness of its development community. An active development community is a strong contributor to the success of an open source project. Therefore commit activity analysis along with the trend and regularity analysis for commit activity of open source software project acts as an important indicator to the project managers and analyst regarding the evolutionary prospects of the project in the future.

1. Introduction

Understanding software evolution in general and open source software (OSS) evolution in particular has been of wide interest in the recent past. A wide range of research studies have analysed OSS project evolution from different points of views such as growth [1], quality [2], and group dynamics [3]. However, there are a very few studies on commit activity in OSS projects. A commit is a change to a source code entity submitted by a developer through a source code management (SCM) system. SCM systems, such as Subversion (SVN) and git [4], manage the source code files of OSS systems and maintain log of each change (a.k.a. commit) made to the files. Committing is an important activity of the OSS development approach. Most of the OSS developers being volunteers, the success of these OSS projects is mainly determined by the committing activities of developers [5]. Commit activity indicates project activity which is further related to project success [6, 7]. Stakeholders of an OSS project, such as project managers, developers, and users, are interested in its future change behavior. Analysing the commit activity of an OSS project for finding trend and regularities in the evolution helps in indicating the future change behavior of the project and helps in decision making as far as project usage and management are concerned.

The OSS development is a stochastic process. Unlike the traditional development in which the environment is controlled, OSS development is based on contributions from volunteers who could not be forced to work even if something is of high priority for the project [1]. Along with this unplanned activity, there is a lack of planned documentation related to requirements and detailed design [8]. Classical time series techniques are inappropriate for analysis and forecasting of the data which involves random variables [8, 9]. Fuzzy time series can work for domains which involve uncertainty.

Commit activity of an OSS project is measured with the number of commits per month metric [5]. Kemerer and Slaughter [9] and Mockus and Votta [10] emphasize that the commits available in a SCM system (such as git [4]) can be used as a metric to study the evolution of OSS systems, and these studies motivate us too to choose number of commits per month as a metric to analyse and predict software evolution.

In this research work, the hybrid approach proposed by Chen et al. [11] (fuzzy theory along with data mining algorithm) is used to generate linguistic rules from the time series data. In [11], Chen et al. specified the finding application for their algorithm and validating it as the future work. This present study considers both of these issues as one of the objectives. In this proposed work we divide the considered time series dataset into two data subsets, that is, training set and remaining set. We use the algorithm first to generate the association rules from the training set and then validate the accuracy and prediction capability of these rules on the remaining set. The high prediction accuracy indicates that the commits in the remaining set have regularity and trend in the number of commits performed for Eclipse CDT. The low prediction accuracy specifies that the commits on remaining set are not consistent with those of training set and there is irregularity and detrend in the commits performed.

Main objective of the present study is to explore the fuzzy data mining approach for time series used to generate the association rules for evaluating the existing trend and regularity in the evolution of OSS projects. As commit activity is a good indicator of continuous development activity of an OSS project, another objective of the present work is to develop a commit prediction model for OSS systems. Project managers, developers, and users can use the commit prediction model to understand the future commit activity of an OSS project and then plan schedule and allocate resources accordingly as per their role.

The rest of the paper is as follows. Section 2 presents the related work to study software evolution and the used fuzzy data mining technique. Section 3 explains the research methodology. Section 4 presents details of the experimental setup used for performing the study. Section 5 gives the results and analysis. Threats to validity of the results are explained in Section 6. Last section concludes the paper.

The idea to analyse and predict the software evolution was seen initially in the late 1980s, when Yuen published his papers on the subject in a series of conferences on software maintenance [1214]. He used time series analysis as a technique for software evolution prediction. Later, several studies used time series analysis for predicting software evolution. The software evolution metrics undertaken for prediction include the monthly number of changes [8, 9], change requests [15, 16], size and complexity [17, 18], defects [19, 20], clones [21], and maintenance effort [22].

Kemerer and Slaughter [9] looked at the evolution of two proprietary systems using two approaches: one based on the time series analysis (ARIMA) and the other based on a technique called sequence analysis. They found ARIMA models inappropriate for analysis as the dataset was largely random in nature. Antoniol et al. [21] presented an approach for monitoring and predicting evolution of software clones across subsequent versions of a software system (mSQL) using time series analysis (ARIMA).

Caprio et al. [17] used time series analysis to estimate size and complexity of the Linux Kernel. They used ARIMA model to predict evolution of the Linux Kernel by using dataset related to 68 stable releases of the software system.

Herraiz et al. [8] applied a stationery model based on time series analysis to the monthly number of changes in the CVS repository of Eclipse. Their model predicted the number of changes per month for the next three months. They employed Kernel smoothing to reduce noise, a lesson learned from Kemerer and Slaughter [9] study who could not get good results of ARIMA modelling for predicting number of changes, as they ignored noise present in the data.

Kenmei et al. [16] applied ARIMA to model and forecast change requests per unit of size of large open source projects. Data from three large open source projects, Mozilla, JBoss, and Eclipse, confirm the capability of the approach to effectively perform prediction and identify trends. They report the evidence that ARIMA models almost always outperform the predictive accuracy of simple models such as linear regression or random walk [16]. The benchmark models selected by Kenmei et al. [16] for evaluating the prediction accuracy of ARIMA model are not rigorous.

Raja et al. [20] did time series analysis of defect reports of eight open source software projects over a period of five years and found ARIMA (0, 1, 1) model to be useful for defect prediction. Kläs et al. [19] combined time series analysis with expert opinion to create prediction models for defects. They suggest that a hybrid model is more powerful than data models in the early phases of a project’s life cycle.

Goulão et al. [15] used time series technique for long-term prediction of the overall number of change requests. They investigated the suitability of ARIMA model to predict the long-term fluctuation of all change requests for a project having seasonal patterns, such as Eclipse. They found that their ARIMA model is statistically more significant and outperforms the nonseasonal models.

Amin et al. [23] used the ARIMA model in place of software reliability growth model (SRGM) to predict the software reliability. SRGM has restrictive assumption on environment of the software under analysis. They specified that the ARIMA modelling is far better than SRGM approach as ARIMA is data oriented and cover all limitations of the previous approaches.

A perusal of the existing research in this area shows ARIMA modelling as the most frequently used prediction procedure. However, OSS development is a stochastic process. Unlike the traditional development in which the environment is controlled, OSS development is based on contributions from volunteers who could not be forced to work even if something is of high priority for the project [1]. Along with this unplanned activity, there is a lack of planned documentation related to requirements and detailed design [8].

Open source projects, without any tight organizational support, face many uncertainties. Uncertainty lies in an uncontrolled development environment such as availability of contributors at any point of time. Due to uncertainty, there is a large fluctuation in consecutive values (as observed in monthly commit data of Eclipse CDT in Figure 1). Most of the classical time series techniques are inappropriate for analysis and forecasting of the data which involve uncertainty [8, 9]. Research literature also indicates that ARIMA modelling can be useful when there is uncertainty in the data, but only after applying smoothing to reduce noise [24].

Hong et al. [25] introduced the fuzzy data mining algorithm for quantitative values. The algorithm extracts the useful knowledge from the transactional database having quantitative values. The algorithm combines the fuzzy set concept with that of Apriori algorithm.

Chen et al. [11] extended the work of Hong et al. and analysed the fuzzy data mining algorithm on time series data. They proposed an approach in which the concepts of fuzzy sets are used along with the data mining Apriori algorithm to generate linguistic association rules. They use the fuzzy membership function to convert the time series data to fuzzy set and then apply the Apriori algorithm to generate association rules accordingly. They specified the validation and finding of applications for their algorithm as the future work. So, with the continuation of their work, we are using algorithm for analysing the commit activity of open source software project for finding existing regularity and trend in the commit activity of OSS project.

Suresh and Raimond [26] extended the work of Chen et al. [11]. They proposed a new algorithm called extended fuzzy frequent pattern algorithm for analysing the time series data. The association rules are generated without generating the candidate sets.

This paper uses a fuzzy data mining approach on time series [11] to analyse the commit activity in open source software projects. The research questions that this study aims to answer are (1) analysing the commit activity of OSS project for finding the trend and regularity and (2) validating the fuzzy data mining algorithm on OSS project data.

3. Methodology

The objective of this empirical study is to investigate the suitability of the fuzzy data mining method to analyse the number of commits per month as a software system evolves for finding the regularity and trend. This section describes the data collection process, basic concepts of fuzzy time series, and the fuzzy data mining method.

3.1. Data Collection

The development repository of open source software project (Eclipse CDT) is obtained from GIT Hub [27]. A repository is downloaded by making the clone of the original repository onto the local machine by using GIT Bash [4]. A script is written in JAVA to fetch the number of commits per month for the observation period for all the software projects. The descriptive statistics about the development repositories of all software projects is shown in Table 1.

Eclipse [28] is an integrated development environment. Eclipse has base workspace and extendable plug-in for customizing the development environment. It is used to develop application in different languages such as JAVA, C/C++, COBOL, and PHP, by using available plug-in. The two variations Eclipse SDK and Eclipse CDT are well known for developing applications. Eclipse SDK is compatible with JAVA and used by JAVA developers for building project on Eclipse platform. Eclipse CDT provides C/C++ development tooling [29]. Eclipse CDT allows developing application in C/C++ using Eclipse. Eclipse CDT provides various features [30] such as full featured editor, debugging, refactoring, parser, and indexes. In this study we consider Eclipse CDT only. The number of commits for each month from 6/27/2002 to 9/23/2013 is arranged month-wise to form a time series (for 136 months) shown in Figure 1.

3.2. Basics of Fuzzy Set Theory and Fuzzy Time Series

The concept of fuzzy set was introduced by Zadeh [31] in 1965. It was the extension of classical set theory. The fuzzy set is characterized by degree of membership function [31]. The membership function can be of various forms such as triangular, -function, -function, and trapezoidal function used depending upon the application and requirement. The present study uses both and membership function to define the values of fuzzy variable.

3.3. Apriori Algorithm

Apriori algorithm [32, 33] is one of the classical algorithms proposed by R. Srikant and R. Agrawal in 1994 for finding frequent patterns for generating association rules. Apriori employs an iterative approach known as level-wise search, where -itemsets are used to explore ()-itemsets.

Apriori algorithm is executed in two steps. Firstly it retrieves all the frequent itemsets whose support is not smaller than the minimum support (min_sup). The first step further consists of join and pruning action. In joining, the candidate set is produced by joining with itself. In pruning, the candidate sets are pruned by applying the Apriori property; that is, all the nonempty subsets of frequent itemset must also be frequent.

The pseudocode for generation of frequent itemsets is as follows:: Candidate itemset of size : Frequent itemset of size 1-itemsetFor with to generate ; in with support greater than or equal to min support;;Return ;

Next, it uses the frequent itemsets to generate the strong association rules satisfying the minimum confidence (min_conf) threshold. The pseudocode for generation of strong association rules is as follows:Input:Frequent Itemset, Minimum confidence threshold, min_confOutput: Strong association rules, for each frequent itemset in for each non-empty subset s of if conf >= min_conf  generate strong association rule r = “

3.4. Fuzzy Data Mining Algorithm for Time Series Data

Chen et al. [11] extended the work of Hong et al. [25] and proposed the fuzzy data mining for time series data. The time series data of points are entered as input along with the predefined minimum support , minimum confidence , and window size of .

The input data is first converted to generate sequences; each subsequence has elements. The fuzzy membership function is used to convert each data item into the equivalent fuzzy set. The Apriori algorithm is used to mine frequent fuzzy sets. Moreover, the data reduction method is used to remove the redundant data items.

The association rules are generated in the same way as generated in Apriori algorithm. The stepwise process of the fuzzy Apriori algorithm is given below.

Input:

Time series with data points

Membership function values

Minimum support

Minimum confidence

Sliding window size

Fuzzy set

Output: Set of fuzzy association rules

Step  1. Convert the time series data into sequences, where each sequence has maximum of elements. Suppose we assume ; then each sequence has maximum of 5 elements. The elements of subsequence are referred to as data variable and given as , , , , and , respectively.

Step  2. Apply fuzzy membership function on elements of time series to generate fuzzy set ().

Step  3. Based on the membership function and its user defined level (suppose low, middle, and high), each data variable after conversion to fuzzy item lies in different user defined levels (such as , , and referred to as fuzzy items).

Step  3. Calculate the scalar cardinality count of each fuzzy item of subsequences.

Step  4. Compare the total scalar cardinality count of each subsequence with the minimum support value. The sequences with value greater than or equal to are kept in .

The support value of subsequence is generated as

Step  5. If , then exit; else do the following step for to .

Step  6. Join with to generate candidate () fuzzy itemset (i.e., ) (similar to Apriori algorithm except not joining the items generated from same order of data point and join is possible only if () data items in both sets are the same). (i)Calculate the fuzzy value of each candidate fuzzy itemset by using fuzzy set theory, that is, Min ().(ii)Count the scalar cardinality of each fuzzy candidate itemset.(iii)If count is more than or equal to , then put it in and calculate its support value as

Step  7. If , then exit; otherwise go to Step again.

Step  8. Remove redundant large itemset (i.e., by shifting each large itemsets () into () such that fuzzy region becomes when shifted).

Step  9. Generate the association rules using Apriori rule generation method and calculate the confidence of each rule. The only variation is that in place of normal data itemset we have fuzzy itemset, so concept of intersection is used in rule not union. If confidence value is not less than minimum confidence , then keep the rule; otherwise reject.

4. Experimental Setup

The experiment is performed on x86 Family 6 Model 15 Stepping 6 GenuineIntel ~2131 Mhz Processor 1 GB RAM, Microsoft Windows XP Professional operating system, version 5.1.2600 Service Pack 3 Build 2600, 240 GB hard disk. The fuzzy data mining algorithm for time series data is implemented using JAVA.

5. Results and Analysis

The dataset is divided into two subsets: training dataset (120 months) and remaining dataset (16 months). The complete process of generating the association rule using fuzzy data mining algorithm is performed in two steps.(A)The association rules are generated on training dataset using fuzzy data mining for time series data algorithm.(B)The validation of generated association rule is done using training and remaining dataset.

5.1. Generation of Association Rules from Training Dataset

The fuzzy data mining algorithm of time series [11] is applied on training dataset to generate association rule. We assumed , ( of subsequences), and ; membership function and its values are shown in Figure 2. These assumptions are made by referring to and understanding the concepts of fuzzy data mining algorithm given in [11]. The commits are divided according to the level of activity. The commits in the range of 0–100, 250–300, and 450 onwards indicate the low, middle (average), and high level of activity, respectively. The level of commit activity indicates the amount of work done in a commit.

(i) Generation of SubsequencesWe have (as number of months) (window size)The number of subsequences is generated as ()Therefore number of All generated subsequences for (shown in Table 8 of the Appendix).

(ii) Transformation of Data to Fuzzy Sets. In this step, we transform the commit activity data into fuzzy set (using membership function) and count the scalar cardinality of each data variable (shown in Table 9 of the Appendix).All these data variables are considered as candidate itemsets ().For generation of , data variables having count more than are considered.It is found that contain , , , , , , , , , and .

Further, calculate support value of each candidate itemset using

(iii) Generation of and . For generating (shown in Table 2), join with , not joining the items generated from same order of data points; that is, join with is not allowed, similar to all others also. Meanwhile, joining all the properties of Apriori algorithm is used.

Next, use function to find the value of each of the candidate fuzzy sets. Count the scalar cardinality of each of the candidate sets in .

The candidate fuzzy sets with a value not less than the threshold value are kept in and also find their support value. Now, contain fuzzy itemsets shown in Table 3.

(iv) Generation of and . For generating , join with , not joining the items generated from same order of data points. In case of , join is possible between only those fuzzy sets where at least one data item is common in both. After generation of , only those fuzzy itemsets are put in (shown in Table 4) whose count is not less than threshold value.

(v) Generation of . Join with , not joining the items generated from same order of data point. Join is possible only for those fuzzy sets where at least two data items are common in both.

After generation of , it is found that no element has count more than threshold value; therefore .

(vi) Removal of Redundant Large Itemsets. Remove the redundant large itemsets from using Step of the algorithm described in Section 3.4.

After applying Step of algorithm, we are left with these large itemsets (shown in Table 5).

(vii) Generate Association Rules. Generate the association rules from these large itemsets using Step of the algorithm described in Section 3.4. All the generated rules are shown in Table 6 where strong rules having count more than threshold confidence are marked as bold.

In this experiment we use ; it means only those rules are valid (and are called strong association rules) and have support value not less than 65%. We found 18 rules in this case; each rule acts as the knowledge base for the project manager and developers of the project. For example, the generated rule specifies that if the value of first and second data point lies in the middle, then there is high probability that the third data point has middle value also. All these rules act as a precise and compact knowledge for project manager and analyst.

5.2. Validation of Association Rules Using Training and Remaining Dataset

(i) Validation on Training Dataset. It is found that 65% of the transactions in the Eclipse CDT training set follow these rules. These rules act as a compact and concrete knowledge of this data.

(ii) Validation on Remaining Dataset. All generated association rules are tested on remaining dataset values to find the regularity and trend in the number of commits analysed to find whether the commits are performed at same rate or not. If the rate is same, then it means Eclipse has consistent growth; otherwise there exists a variation in the considered software growth. It is found that in case of remaining set the applicability of these rules decreases. This thing is again verified by generating the association rules from the remaining set. The generated rules from remaining dataset specify that the data in the remaining dataset is more towards lower range of commits performed.

Following are the factors due to which this variation in the commit rate is found: (a)Most of the values in the remaining dataset are towards low range. This point is verified by finding the largest frequent fuzzy itemsets and then generating the association rules from the remaining dataset by using fuzzy data mining for time series data algorithm. The largest frequent itemset found is shown in Table 7.The generated frequent itemsets consist of only data points with low range. Hence most of the entries of commits in remaining datasets are probably low. This specifies that the numbers of commits analysed from 6/1/2012 to 9/23/2013 is less as compared to the number of commits analysed in the training dataset. It specifies the less activity in the development of considered software growth in this period. (b)Factors like the number of active users and number of files changed (addition, deletion, and modification) are less.

6. Discussion

The fuzzy data mining algorithm for time series data [11] allows efficient mining of the association rules from the large dataset. These generated rules help in finding the regularity and existing trend for OSS projects. We have used the commit activity data of Eclipse CDT. The commits in repository are directly related to the activities such as file or code changed, deleted, or modified. By analysing the trend in commits data, we can interpret the development or evolution activity of the considered software. In the above experiment, the original dataset is divided into two subdatasets (training and remaining dataset).

The generated association rules from training set allow analysing the regularity and trends in the commits of OSS project. These rules also help in predicting and analysing the future evolution or development activity of the Eclipse CDT. The generated rules are validated on the remaining dataset to find its applicability. It is found that applicability of these rules on remaining set decreases. The results of the algorithm indicate that the commit activity of remaining dataset has low activity range. There may be other factors also which are the cause of this decrease behavior in the number of commits. These may include factors like the number of active users and number of files changed (addition, deletion, and modification) which are less.

7. Threats to Validity

This section discusses the threats to validity of the study.

Construct validity threats concern the relationship between theory and observation. These threats can be mainly due to the fact that we assumed all the commits posted in the revision control tool git [4]. Any changes performed in the source code, but not logged through the tool, may not have become part of the study.

Internal validity concerns the selection of subject systems and the analysis methods. This study uses a month as the unit of measure for tracking the types of change activities. In the future, we would like to use more natural and insightful partition based on major/minor versions of the OSS project for analysing the change activity of OSS projects. Subject systems were selected from public repositories but selection is biased towards projects with valid git repositories.

External validity concerns the generalization of the findings. In the future, we would like to provide more generalized results by considering higher number of OSS projects.

Reliability validity concerns the possibility of replication of the study. The subject systems are available in the public domain. We have attempted to put all the necessary details of the experiment process in the paper.

8. Conclusion and Future Work

The commit activity data available in the development repository of open source software can be used to analyse the evolution of OSS projects as each commit is directly related to the development activity such as code deletion, addition, modification, comments, and file addition. In this study, the Eclipse CDT commits data is analysed to find the regularity and trend in the commit data. The fuzzy data mining algorithm for time series data is used to generate the association rules from the dataset. The dataset is divided into two subsets (training and remaining dataset) to evaluate the pattern of evolution of the Eclipse CDT.

After applying and validating the generated association rule from training dataset, it is found that the rates at which commits performed in training dataset and remaining dataset are different. This thing is again verified by generating the association rules from the remaining set. The generated rules from remaining dataset specify that the data in the considered remaining dataset is more towards lower range of commits performed. This thing validates the applicability of the rules generated from the training dataset.

These association rules indicate that the overall commits in the Eclipse CDT are towards middle range except the variation found near the end, where there is a high probability that commits lie in the lower range. The continuous availability and existence of commits in the repository of the Eclipse CDT illustrate that the development or evolution of Eclipse CDT is active with most of the commits per month that lie in the middle range and at the end lie near to the lower range. In the future, we want to consider any prediction algorithm along with the concept of fuzzy data mining algorithm for time series data to give a prediction about the number of commits to be performed in the particular month, although there are other various factors that need to be considered on which the number of commits depends.

Appendix

See Tables 8 and 9.

Competing Interests

The authors declare that they have no competing interests.