#### Abstract

We use least squares support vector machine (LS-SVM) utilizing a binary decision tree for classification of cardiotocogram to determine the fetal state. The parameters of LS-SVM are optimized by particle swarm optimization. The robustness of the method is examined by running 10-fold cross-validation. The performance of the method is evaluated in terms of overall classification accuracy. Additionally, receiver operation characteristic analysis and cobweb representation are presented in order to analyze and visualize the performance of the method. Experimental results demonstrate that the proposed method achieves a remarkable classification accuracy rate of 91.62%.

#### 1. Introduction

There is a growing tendency to use clinical decision support systems in medical diagnosis. These systems help to optimize medical decisions, improve medical treatments, and reduce financial costs [1, 2]. A large number of the medical diagnosis procedures can be converted into intelligent data classification tasks. These classification tasks can be categorized as two-class task and multiclass task. The first type separates the data between only two classes while the second type involves the classification of the data with more than two classes [3].

Cardiotocography was introduced into obstetrics practice in the early 1970s, and since then it has been used as a worldwide method for antepartum (before delivery) and intrapartum (during delivery) fetal monitoring. Cardiotocogram (CTG) is a recording of two distinct signals, fetal heart rate (FHR), and uterine activity (UA) [4]. It is used for determining the fetal state during both pregnancy and delivery. The aim of the CTG monitoring is to determine babies who may be short of oxygen (hypoxic); thus further assessments of fetal condition may be performed or the baby might be delivered by caesarean section or natural birth [5]. The visual evaluation of the CTG not only requires time but also depends on the knowledge and clinical experience of obstetricians.

A clinical decision support system eliminates the inconsistency of visual evaluation. There have been proposed several classification tools for developing such system [4, 6–10].

One of these tools is support vector machine (SVM) and it is used in [4, 8, 10]. In [4, 8], SVM is used for FHR signal classification with two classes, normal or at risk. The risk of metabolic acidosis for newborn based on FHR signal is predicted in [4] while the classification of antepartum FHR signal is made in [8]. In [10], a medical decision support system based on SVM and genetic algorithm (GA) is presented for the evaluation of fetal well-being from the CTG recordings as normal or pathologic.

In [6], an approach based on hidden Markov models (HMM) is presented for automatic classification of FHR signal belonging to hypoxic and normal newborns. In [7], an ANBLIR (Artificial Neural Network Based on Logical Interpretation of fuzzy if-then Rules) system is used to evaluate the risk of low-fetal birth weight as normal or abnormal using CTG signals recorded during the pregnancy.

In [9], an adaptive neurofuzzy inference system (ANFIS) is proposed for the prediction of fetal state from the CTG recordings as normal or pathologic.

Support vector machines (SVM) is developed for two-class task, but classification problems generally require multi-class task. There are several methods proposed in the literature based on binary decision tree (BDT) to extend the binary SVMs to multi-class problems, for example, [11, 12].

LS-SVM is a modified version of SVM in a least square sense [13]. The higher computational load of SVM is overcome by LS-SVM because LS-SVM solves the problem using a set of linear equations while SVM solves as a quadratic programming problem.

The choice of appropriate kernel function and the model parameters (including kernel parameters) is crucial for SVM-based methods, and this influences directly the classification performance. The most common kernel functions used in the literature are polynomial, Gaussian radial basis, exponential radial basis, and sigmoid.

Performance evaluation of classifiers is a fundamental step for determining the best classifier or the best set of parameters for a classifier [14]. In general, the overall classification accuracy is a natural way to measure the performance of the classifiers. The classifier predicts the class for each data point in the data set; if the prediction is correct it is counted as a success and if it is wrong it is counted as an error. The overall classification accuracy is computed as the ratio of the number of successes over the number of the whole data points to be classified.

For many classification problems, especially in the medical diagnosis, the overall classification accuracy is not adequate alone because in general not all errors have the same consequences. Wrong diagnoses can cause different cost and dangers depending on which kind of mistakes have been done [15]. Therefore, for such situations, in addition to overall classification accuracy receiver operation characteristic (ROC) analysis is usually performed [16].

In this paper, we use LS-SVM utilizing a BDT for classification of the CTG data to determine the fetal state as normal, suspect, or pathologic. Gaussian radial basis function is chosen as the kernel of LS-SVM, and the model parameters, which are the penalty factor and the width of Gaussian kernel, are optimized by using particle swarm optimization (PSO). The robustness of the proposed method LS-SVM-PSO-BDT is examined with 10-fold cross-validation (10-fold CV) on the CTG data set taken from UCI machine learning repository. The performance of the method is evaluated in terms of overall classification accuracy. Additionally, ROC analysis and cobweb representation are presented in order to analyze and visualize the performance of the method.

#### 2. Support Vector Machine (SVM)

SVM is a powerful supervised learning algorithm based on statistical learning theory that has been widely used for solving a wide range of data classification problems since it was first introduced by Boser et al. [17]. SVM builds a hyperplane separating the data points into two different classes with a maximum margin.

A given training set of data points , , and , where is a data point and is the corresponding class label; SVM requires the minimization of the following primal optimization problem: where is the normal vector to hyperplane, is the bias or offset scalar, are the slack parameters which are used to allow soft margins, is the penalty parameter which controls the trade-off between minimizing the error and maximizing the margin, and is a nonlinear mapping from the input space to the higher dimensional feature space [4, 8, 13, 17, 18].

The corresponding dual problem of (1) is given by where are Lagrange multipliers, the term is a kernel function representing the inner product of two vectors in the feature space, that is, . Kernel function must satisfy the well-known Mercer’s condition. The data points for which are called support vectors, which construct the following decision function [4, 8, 13, 17, 18]: where , and are two arbitrary supporting vectors from different classes [17].

#### 3. Least Squares SVM (LS-SVM)

LS-SVM is originally proposed by Suykens and Vandewalle as a modification to SVM regression formulation [13]. The idea behind the modification is to transform the problem from a quadratic programming problem to solving a set of linear equations.

The optimization problem has been modified as follows: where and are similar to the penalty parameter and the slack variable of SVM, respectively. In (4), it can be easily seen that the following two modifications are made; the first one is that the inequality constraints are replaced by the equality constraints, and the second one is that the squared loss function is taken for . These modifications significantly simplify the problem [19].

To solve the optimization problem in (4), Lagrangian function is defined as given below: where are Lagrange multipliers, which can be positive or negative due to the equality constraints. According to optimality conditions, we can get Defining , , , , and after elimination of and , a linear Karush-Kuhn-Tucker system is obtained as in (7) [13]: where and the Mercer’s condition can be applied to the matrix Ω: LS-SVM classifier takes the form as in (9) which is similar to SVM case as in (3) and found by solving the linear set of equations in (7):

#### 4. Particle Swarm Optimization (PSO)

PSO is a swarm intelligence based optimization method proposed by Kennedy and Eberhart inspired by social behavior of bird flocking and fish schooling [20]. In PSO, the procedure begins with an initialization step in which a population (swarm) of possible solutions (particles) is chosen in the search space and then searches for optimum solution by updating particles over generations.

The particles are updated by iteratively by using the following equations: where and are the current position and the velocity of the th particle in dimensional space and and are the best position of the swarm and the best position of the th particle, respectively.

The value of inertia weight is a trade-off between global search and local search. A bigger value of inertia weight allows the particles to search new areas in the search space (global search) while a smaller value let the particles move in the current search area for fine tuning (local search). The cognitive and the social learning factors and are positive constants, and and are random numbers in the range [20, 21].

#### 5. Binary Decision Tree (BDT)

BDT architecture for classification of data sets with classes requires classifiers. The architecture for classification of a data set with classes is shown in Figure 1. There is a classifier at each node in the tree to make a binary decision.

#### 6. Cross-Validation (CV)

CV is a most commonly used statistical method for evaluating and comparing the learning algorithms by separating the data set into two sets as training and testing. In CV, the training and testing sets must cross-over in successive rounds, and thus each data point has a chance of being validated against [22].

General form of CV is -fold CV in which the data set is divided into groups of (almost) equal size, and iterations are made. In each iteration step, one of the groups is used for testing and the remaining groups are used for training.

#### 7. ROC Analysis

ROC analysis has been used a standard tool for the design, optimization, and evaluation of two-class classifiers [23]. In ROC analysis with two classes, the notation, which is given in Table 1, is used for the confusion matrix [24].

ROC analysis investigates and employs the relationship between sensitivity and specificity of two-class classifiers while decision threshold varies [25]. Sensitivity is the true positive rate while specificity is the true negative rate, and they are defined as TP/(TP+FN) and TN/(TN+FP), respectively [24].

ROC curve represents the performance of a classifier in a two-dimensional graph, and conventionally the true positive rate is plotted against the false positive rate [25]. Detailed information about ROC analysis can be found in [23–28].

The extension of ROC analysis for more than two classes has been studied extensively in the literature [15, 23, 27, 29, 30]. For classes, the confusion matrix is matrix such that its diagonal entries contain the correct classifications while its off-diagonal entries contain possible errors. Therefore, generating ROC curves for visualizing the performance of a classifier becomes difficult as the number of classes increase, for example, a six-dimensional space is required for three classes. Recently, cobweb representation is used to visualize the performance of the classifiers in the form of multiclass version of ROC analysis [30].

#### 8. Cobweb Representation

The cobweb representation is generated by using the misclassification ratios of the confusion ratio matrix, which is column-normalized version of the confusion matrix. Let us consider a chance classification with classes. The confusion ratio matrix has misclassification rates which are equal to . The misclassification rates of show that when confronted with a data point from one of the classes the classifier classifies it as having the same chances of being from any of classes. A polygon with equal sides can be formed to map the misclassification rates of the confusion ratio matrix. This polygon (chance polygon) is used to compare the performance of any classifier with the chance classifier in terms of misclassification rates. Any polygon within the chance performance polygon shows a better performance than chance performance. For a chance classification with three classes, the misclassification rates are (0.33, 0.33, 0.33, 0.33, 0.33, 0.33), and the chance polygon becomes a hexagon given as in Figure 2 [30, 31].

#### 9. CTG Data Set

The CTG data set used in this study is taken from UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets/Cardiotocography], (last accessed: June, 2013) and the details can be found in [32]. This data set has 2126 data points from three classes representing the fetal state as normal, suspect, or pathologic. All data points have 21 features, and these features are listed in Table 2.

#### 10. Proposed LS-SVM-PSO-BDT Method

The proposed LS-SVM-PSO-BDT method for fetal state determination is described in this section. Its architecture is given in Figure 3.

There are two nodes in BDT due to that the CTG data has three classes. A Gaussian radial basis function, which is illustrated in (11), is chosen as the kernel function of LS-SVMs: where is the width of the kernel.

LS-SVM parameters, the penalty factor , and the kernel width are optimized by using PSO.

Training procedure of the method is summarized as the following sequential steps.

*Step 1. *Training data points are put into the root node and divided into two groups as PS (pathologic and suspect) and Nr (normal).

*Step 2. *LS-SVM_ 1 is trained on the data points in the root node to classify the data points as PS or Nr. Meanwhile LS-SVM_ 1 parameters are optimized by using PSO.

*Step 3. * LS-SVM_ 2 is trained on the data points in the subnode PS to classify the data points as P (pathologic) or S (suspect). Meanwhile, LS-SVM_ 2 parameters are optimized by using PSO.

In the first step, the reason why we combine pathologic and suspect data points in one group instead of combining normal and suspect data points is to minimize the risk of making decisions that cause abnormalities in babies.

#### 11. Experimental Results and Discussions

The proposed method LS-SVM-PSO-BDT is used for the classification of the CTG data set which is taken from the UCI Machine Learning Repository.

In order to validate the robustness of the method a 10-fold CV procedure is performed. The entire data set is randomly divided into ten subsets of approximately equal size while keeping the proportion of data points from different classes in each subset roughly the same as that in the whole data set. In each fold, one subset is left out for testing, and the union of the remaining nine sets is used for training. Thus, after ten folds, each subset is used once for testing purpose. The final result is average result of these ten folds.

In the experiment, the parameters for LS-SVM-PSO-BDT are set as follows. Twenty-five particles are used in PSOs. The initial values of 25 particles for the penalty factor and the kernel width are chosen on the intervals , .

The inertia weight, cognitive, and social learning factors of PSOs are chosen as , , and . The codes for the proposed method have been developed in MATLAB [33], without using any toolbox. The classification accuracies for ten folds are reported in Table 3.

The overall classification accuracy of LS-SVM-PSO-BDT, which is average accuracy of ten folds, is obtained as 91.62%.

There have been similar works focusing on the classification of the CTG data in the literature [4, 6–10]. It is not possible to make a direct comparison of the methods in these works with the proposed method because they are all used for two-class task and additionally the properties of the CTG data sets used in [4, 6–8] are different. But, based on the overall classification accuracy, a comparison of the proposed method with the methods used in above mentioned works is provided in Table 4.

Although the number of classes and the number of data points in the CTG data set used in our work are larger than those in above mentioned works, LS-SVM-PSO-BDT achieves a remarkable classification accuracy rate of 91.62%.

In addition to overall classification accuracy ROC methodology is used to analyze the performance of the method in more detail. Therefore, a confusion matrix is created to analyze the classification results, which is given in Table 5. This table shows the number of correctly and incorrectly classified data points from the CTG data.

In order to visualize the performance of the proposed method a cobweb representation is presented. Cobweb representation is generated by using the misclassification ratios from the confusion ratio matrix, which is column-normalized version of the confusion matrix. The confusion ratio matrix of the proposed method is given in Table 6.

Diagonal entries of the confusion ratio matrix show the correct classification ratios while its off-diagonal entries show the misclassification ratios. From Table 6, 96.90% of normal data points, 70.50% of suspect data points, and 76.70% of pathologic data points are correctly classified as normal, suspect, and pathologic, respectively.

Cobweb representation of the proposed method is given in Figure 4. It can be seen from Figure 4 that the misclassification ratios of LS-SVM-PSO-BDT are smaller than those of the chance classifier.

#### 12. Conclusions

In this work, we use LS-SVM utilizing a BDT for classification of the CTG data to determine the fetal state as normal, suspect, or pathologic. Gaussian radial basis function is chosen as the kernel of LS-SVM, and the model parameters, which are the penalty factor and the width of Gaussian kernel, are optimized by using PSO. The robustness of LS-SVM-PSO-BDT is examined by running 10-fold CV. The performance of the proposed method is evaluated in terms of overall classification accuracy. According to empirical results, the proposed LS-SVM-PSO-BDT method achieves a remarkable overall classification accuracy rate of 91.62%.

Additionally, ROC methodology is used to analyze the performance of the method in more detail. The correct classification and misclassification ratios of the method with the respect to each individual class are presented. 96.90% of normal data points, 70.50% of suspect data points, and 76.70% of pathologic data points are correctly classified as normal, suspect, and pathologic, respectively. In order to visualize the performance of the method, a cobweb representation is presented. This representation indicates that misclassification ratios of the proposed method are smaller than those of the chance classifier. Empirical results show that the proposed method can help the obstetricians to make more accurate decision in determining the fetal state.

#### Acknowledgment

The authors would like to thank the UCI Repository of Machine Learning Databases for being a valuable resource: Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.