Scientifica

Volume 2016, Article ID 4273813, 6 pages

http://dx.doi.org/10.1155/2016/4273813

## Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset

Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, P.O. Box 344, Rabigh 21911, Saudi Arabia

Received 9 December 2015; Revised 31 January 2016; Accepted 1 February 2016

Academic Editor: Dick de Ridder

Copyright © 2016 Amir Ahmad. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The early diagnosis of breast cancer is an important step in a fight against the disease. Machine learning techniques have shown promise in improving our understanding of the disease. As medical datasets consist of data points which cannot be precisely assigned to a class, fuzzy methods have been useful for studying of these datasets. Sometimes breast cancer datasets are described by categorical features. Many fuzzy clustering algorithms have been developed for categorical datasets. However, in most of these methods Hamming distance is used to define the distance between the two categorical feature values. In this paper, we use a probabilistic distance measure for the distance computation among a pair of categorical feature values. Experiments demonstrate that the distance measure performs better than Hamming distance for Wisconsin breast cancer data.

#### 1. Introduction

Breast cancer is the most common form of cancer amongst women [1]. Early and accurate detection of breast cancer is the key to the long survival of patients [1]. Machine learning techniques are being used to improve diagnostic capability for breast cancer [2–4]. Wisconsin breast cancer dataset has been a popular dataset in machine learning community [5]. Various classification techniques such as techniques like decision trees [6], support vector machines [7], and fuzzy-genetic algorithm [8] have been used to study this dataset. In medical datasets, sometimes it is difficult to put some data points in one of the groups. Fuzzy methods are better equipped to handle these kinds of datasets [9–11].

Clustering divides the data points into different groups (clusters) depending upon a similarity measure [12]. The data points in a group (cluster) are similar whereas data points in different groups (clusters) are dissimilar. Clustering algorithms can be divided into two groups [12, 13]: hard clustering algorithms and fuzzy clustering algorithms. In hard clustering, a data point can have a membership to a cluster. However, in fuzzy clustering, a data point has memberships to all the clusters.

-means algorithm [14] is very popular hard clustering algorithm because of its linear complexity. -means clustering algorithm is an iterative algorithm which computes the mean of each feature of data points presented in a cluster. This makes the algorithm inappropriate for the datasets that have categorical features. Huang [15] extends the -mean algorithm for the datasets having categorical features. Instead of mean, mode is used to represent a cluster. Hamming distance is used to calculate the membership of a data point. In Hamming distance if the feature values are same for two data points the distance is taken as 0; otherwise the distance is taken as 1.

Hierarchical clustering algorithms [12] can be applied for categorical datasets; however they have high computation complexity. This makes them less useful for large datasets.

Fuzzy clustering has shown great promise in understanding medical datasets [10, 11]. It has been shown that the fuzzy clustering can be used to improve the classification performance of various classifiers for diagnosis of breast cancer [16]. Fuzzy -mean (FCM) [17, 18] is one of the most popular clustering techniques. Original FCM clustering technique can only handle numeric features. Using the methodology of FCM, fuzzy -mode algorithm [19] is proposed for categorical datasets. This method use Hamming distance and hard cluster centres. Kim et al. [20] propose a fuzzy clustering algorithm that uses fuzzy cluster centres. This algorithm performs better than fuzzy -mode algorithm [20].

Most of fuzzy clustering algorithms for categorical datasets use Hamming distance. However, Lee and Pedrycz [21] show that the simple matching similarity like Hamming distance cannot capture the correct similarities among categorical feature values; hence an appropriate distance measure should be used to improve the performance of fuzzy clustering algorithm with fuzzy cluster centres.

Various dissimilarity measures have been proposed for categorical feature values [23]. Ahmad and Dey [22] present a dissimilarity measure for categorical features. Ahmad and Dey [22] show that -mode clustering algorithm can be improved with this dissimilarity measure. Ahmad and Dey [24] use this distance measure to propose a clustering algorithm for datasets having numerical and categorical features. Ahmad and Dey [25] also suggest a subspace clustering algorithm with this dissimilarity measure. Motivated by the success of the dissimilarity measure for clustering categorical data, Ji et al. [26] use the distance measure for fuzzy clustering of mixed datasets. Ahmad and Dey [27] presented a fuzzy clustering method that uses a distance measure that calculates distances for each iteration.

Wisconsin breast cancer dataset has been studied extensively in machine learning field [25, 30–32]. Each feature of Wisconsin breast dataset has ten categories (1 to 10). It has been a popular dataset for analysing clustering algorithms for categorical datasets [25, 30–32]. In this paper, we show the application of the clustering algorithm proposed by Ji et al. [26] for Wisconsin breast cancer dataset. This way we will show the applicability of the distance measure proposed by Ahmad and Dey [22] for the analysis of categorical breast cancer dataset.

This paper has the following organization. We will discuss fuzzy -mean clustering algorithm in Section 2. Section 3 reviews the method that computes the distance between two categorical feature values. Section 4 discusses the method to compute the fuzzy centroid for categorical datasets and the distance between a data point and a cluster centre [26]. Experimental results are presented in Section 5. Section 6 has conclusion and future work.

#### 2. Fuzzy -Mean Clustering Algorithm

Fuzzy -mean (FCM) [17, 18] is a popular clustering algorithm. In this section, we will discuss FCM.

The following information is given:(i)A dataset is with data points.(ii)Each data point is defined by features.(iii)The desired number of clusters is .(iv)Fuzzy membership matrix .

FCM clusters a set of data points, , into clusters, where is the th data point for .

FCM compute the cluster centres , where , and the fuzzy membership matrix . It is done by minimizing an objective function, , presented below iteratively: is used as defined real number which controls the fuzziness)where is the distance between data point and cluster centre .

For numeric data, and are computed as follows:The steps for FCM based algorithm are presented as follows.

*Step 1. *Select a stopping value . Initialize the fuzzy membership matrix* U*. It is done by creating random numbers; these numbers are in the interval . *do*

*Step 2. *Compute cluster centres.

*Step 3. *Compute distances from centres and use these distances for updating fuzzy membership matrix .

*Step 4. *Calculate the objective function . *While* (the difference between two subsequent computed values of is more than the given stopping value ).

#### 3. The Distance between Two Categorical Feature Values

Ahmad and Dey [22] propose an algorithm to calculate the distance between two categorical feature values in an unsupervised framework. Unlike Hamming distance, this distance measure does not take binary measure for the distance between two categorical values. The distance is calculated by computing the cooccurrence of the feature values (for which the distance is calculated) with feature values of other features.

The distance between categorical feature values and of feature against the feature , for a subset of feature values, is defined as follows:

The distance between the feature values and for against feature is presented by and is defined by , where is the subset of feature values of that maximizes the quantity . To compute the distance between and , we compute the distances between and against every other feature. The average distance is taken as the distance, , between and in the dataset. Distances between every pair of feature values are employed to calculate the distance between a data point and a cluster centre.

#### 4. Modified Centre and the Distance from the Modified Centre

For categorical datasets, the mode is used to calculate the centre of clusters [19]. However, taking only one feature value to represent a cluster centre does not capture the cluster centre well; hence loss of information takes place. Ji et al. [26] use the fuzzy centroid [20] concept with distance measure suggested by Ahmad and Dey [22] for fuzzy clustering of categorical datasets.

The fuzzy centroid for a cluster, , for a categorical dataset is defined asAssume that th feature has different values.

Thus,where is the association of value (th feature value for the th feature) with cluster :where = 1 for a data point having th feature value , = 0 for a data point having th feature value .

The distance between a data point having th categorical feature value in the th dimension and the centre of cluster is defined aswhere is the th feature value of th categorical feature.

is calculated by the method discussed in Section 3. For dataset having features, the distance is calculated for each feature value of the data point and the summation of these distances is the distance between the data point and the centre. In FCM, the distances between data points and cluster centres are used to calculate fuzzy membership matrix. Hence, this distance measure will be employed to compute the fuzzy membership matrix.

The cluster centre definition and distances between cluster centre and data points discussed in this section can be used with FCM algorithm discussed in Section 2 to create fuzzy clustering algorithm for categorical datasets [26]. The steps of fuzzy clustering algorithm for categorical data are as follows.

*Step 1. *Select a stopping value . Initialize the fuzzy membership matrix . It is done by creating random numbers; these numbers are in the interval . *do*

*Step 2. *Compute cluster centres by using (5).

*Step 3. *Compute distances from centres by using (8). Hamming distance/distances discussed in Section 3 will be used in this step. Use these distances for updating fuzzy membership matrix* U*.

*Step 4. *Calculate the objective function . *While* (the difference between two subsequent computed values of* J* is more than the given stopping value ).

#### 5. Results and Discussion

The experiments were carried out on Wisconsin breast cancer data. This dataset has 699 data points. Each data point is represented by 9 features. 16 data points have missing values. Missing feature values were replaced by the mode of that feature. The information about these features is given in Table 1. These are two groups in this dataset: benign and malignant. Benign group has 458 data points whereas malignant group has 241 data points. Each feature has categories (0–10). We ran fuzzy clustering with fuzzy centroid with Hamming distance and the distance measure proposed by Ahmad and Dey [22] to see how the incorporation of the distance measure affects the quality of the clustering. *Clustering error* = the number of data points not in desired clusters/the number of data points.