Journal of Electrical and Computer Engineering

Volume 2016, Article ID 2168478, 5 pages

http://dx.doi.org/10.1155/2016/2168478

## A Searching Method of Candidate Segmentation Point in SPRINT Classification

^{1}Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory, Shijiazhuang, China^{2}State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China

Received 5 April 2016; Revised 5 August 2016; Accepted 1 September 2016

Academic Editor: Bin-Da Liu

Copyright © 2016 Zhihao Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

SPRINT algorithm is a classical algorithm for building a decision tree that is a widely used method of data classification. However, the SPRINT algorithm has high computational cost in the calculation of attribute segmentation. In this paper, an improved SPRINT algorithm is proposed, which searches better candidate segmentation point for the discrete and continuous attributes. The experiment results demonstrate that the proposed algorithm can reduce the computation cost and improve the efficiency of the algorithm by improving the segmentation of continuous attributes and discrete attributes.

#### 1. Introduction

In recent years, with the rapid development of economy and the continuous improvement of the level of computer technology, a large number of databases are used in business management, scientific research, and engineering development. In the face of massive storage data, how to find valuable information is a very difficult task. Data mining is to help people to extract valuable information from large, incomplete, random fuzzy data. Classification is a very important section in data mining. The purpose of classification is to construct a function or a model by which data can be classified into one of the given categories. The classification model can achieve the goal of forecasting data [1, 2]. The prediction model is derived from historical data records to represent the trend of the given data, so that it can be used to forecast future data.

The ID3 algorithm is a significant algorithm for building a decision tree [3, 4]. The information gain is used in this algorithm to select node’s attributes in a decision tree. But ID3 has the shortcoming of inclining when choosing attributes in the large scale values. The improved method C4.5 is proposed based on the ID3 algorithm [5, 6], and the C4.5 method uses the information gain rate instead of the information gain to select attributes of the decision tree, which improves the efficiency of decision trees. Then many improved algorithms based on the ID3 algorithm have been proposed, including SLIQ, SPRINT, and other algorithms. The SLIQ [7] algorithm can handle classification of large datasets. The SPRINT algorithm [8–10] based on SLIQ can be unrestricted by memory and its processing speed is considerable.

The SPRINT algorithm has many advantages. This algorithm is unrestricted by memory, and it is a kind of scalable and parallel method of building decision trees. But there are also some shortcomings. For example, finding the best segmentation point of discrete attributes needs a large amount of calculation, and the partition of continuous attributes is unreasonable.

Based on these issues, this paper proposes a new method of searching for the best segmentation point. For the segmentation of discrete attributes, the new method reduces time complexity by avoiding unnecessary computation. For the segmentation of continuous attributes, we can achieve the goal of reducing the depth of decision trees and improving the classification efficiency of decision trees through discretization of continuous attributes.

#### 2. Related Works

Decision tree is one of the most widely used classification models in machine learning applications. Its goal is to extract knowledge from large scale datasets and represent them in a graphically intuitive way.

The paper [1] presents the Importance Aided Decision Tree (IADT), which takes feature importance as an additional domain knowledge for enhancing the performance of learners. Decision tree algorithm finds the most important attributes in each node. Therefore, the mechanism of importance of features in the paper is a relevant domain knowledge for the decision tree algorithm. For automatically designing decision tree, Barros et al. [2] propose a hyperheuristic evolutionary decision tree algorithm tailored to a specific type of classification dataset. The algorithm evolves design components of top-down decision tree induction algorithms.

The key of ID3 algorithm is considering information gain as the reference value for testing attributes, which leads to lower classification accuracy [3]. So the authors in [4] proposed a new scheme for solving the shortcoming of ID3. The paper uses the improved information gain based on dependency degree of condition attributes as a heuristic when it selects the best segmentation attribute.

Ersoy et al. [5] proposed an improved C4.5 classification algorithm with the hypothesis generation process. The algorithm adopts -best Multi-Hypothesis Tracker (MHT) to reduce the number of generated hypothesis especially in high clutter scenarios.

In order to solve the security problems of intrusion detection system (IDS), attack scenarios and patterns should be analyzed and categorized. The enhanced C4.5 [6] is a combination of tree classifiers for solving security risks in the intrusion detection system. The mechanism uses a multiple level hybrid classifier which relies on labeled training data and mixed data. Thus, the IDS system based on C4.5 mechanism can be trained with unlabeled data and is capable of detecting previous attacks.

SLIQ decision tree solves the problem of sharp decision boundaries which are hardly found in classification. Thus the paper [7] proposes a fuzzy supervised learning in Quest decision tree. The authors construct a fuzzy decision boundary instead of a crisp decision boundary. In order to avoid incomprehensible induction rules in a large and deep decision tree, fuzzy SLIQ constructs a fuzzy binary decision tree, which has significant reduction in tree size.

SPRINT decision tree algorithm can predict the quality level of system modules, which is good for software testing [8]. The paper presents an improved SPRINT algorithm to calibrate classification trees. It provides a unique tree-pruning technique based on the minimum description length (MDL) principle. Based on this, SPRINT tree-based software quality classification mechanisms are used to predict whether a software module is fault-prone or not fault-prone.

#### 3. SPRINT Algorithm

##### 3.1. Description of SPRINT Algorithm

The SPRINT algorithm has no limit to the number of input records and its processing speed is considerable. This algorithm creates a list of attributes and a corresponding statistics table for each attribute of the sample data in the initialization phase. Elements in the list of attributes are known as attribute records, which consisted of labels, attribute values, and classes. Statistics tables are used to describe the class distribution of a property, and the C above and C below two lines, respectively, describe the class distribution of processed samples and untreated samples.

Steps of the original SPRINT algorithm are as follows: Maketree (node ) If (node meets the termination conditions) Put node into the queue, labeled as a root node; Return; For (for each attribute ) Update histogram in real time; Calculate and evaluate the index of segmentation for each candidate segmentation points, and find the best segmentation point; Find out the best segmentation for node from the best segmentation for each attribute. Based on it make two part , ; Maketree (); Maketree ();

The termination condition of the algorithm has three kinds of cases. (1) No attribute can be used as testing attribute. (2) If all the training samples in the decision tree belong to the same class, the node is used as a leaf node and labeled by this class. (3) The number of training samples is less than the user-defined threshold.

##### 3.2. Segmentation of Attributes

The traditional SPRINT algorithm uses* Gini* index [5] to search for the best segmentation attribute, which provides the minimum* Gini* index representing the largest information gain.

For a dataset containing classes,* Gini* is defined as

is the frequency of class in . If a partition divides the dataset into two subsets and , and represent the number of records in subsets and , respectively. After the segmentation, the* Gini* value is

A segmentation of attribute values providing the least* Gini* value is chosen as the best segmentation [9].

For discrete attributes and continuous attributes, the SPRINT algorithm uses different processing methods.

In order to find discrete attribute segmentation point [7], we assume that the number of a certain attribute’s values is , which should be divided into two parts. All attribute values are considered as possible partition, and then the corresponding* Gini* value is obtained. There are kinds of possible partitioning ways in total. We need to calculate the* Gini* value for each partitioning way using exhaustive method and then can obtain the best segmentation.

For the solution of finding the continuous attribute’s partitioning point, the split can only occur between two values. First the values of the continuous attribute should be sorted and the candidate segmentation points are intermediate points between two values.

After a scan of sorted values, the statistics table should be updated when a record is read. The statistics table contains all the information needed to calculate the* Gini* index. Then we should calculate the* Gini* index to find the segmentation point with the minimum* Gini* value.

Although the traditional method can find the best segmentation point, it is necessary to traverse all of the segmentation in discrete attributes [8], which makes this algorithm have high time complexity. For the segmentation of continuous attributes, dividing them into two consecutive parts in most cases can not reflect the distribution of attribute values.

#### 4. Improved SPRINT

##### 4.1. Segmentation of Discrete Attribute

Taking credit risk of bank as an example, the data record is shown in Table 1.