Hybrid Mammogram Classification Using Rough Set and Fuzzy Classifier
We propose a computer aided detection (CAD) system for the detection and classification of suspicious regions in mammographic images. This system combines a dimensionality reduction module (using principal component analysis), a feature extraction module (using independent component analysis), and a feature subset selection module (using rough set model). Rough set model is used to reduce the effect of data inconsistency while a fuzzy classifier is integrated into the system to label subimages into normal or abnormal regions. The experimental results show that this system has an accuracy of 84.03% and a recall percentage of 87.28%.
Breast cancer is the most common cancer among women worldwide. National cancer institute  estimates that 192 370-female and 1910-male new cases of breast cancer will appear in the United States in 2009. Also, it is estimated that 40 170 females and 440 males will die of this cancer. Early detection of this disease remains the best known method for reducing its mortality. Also, mammography remains one of the best modalities used by radiologists for early detection of cancerous tumors before clinical symptoms appear. Unfortunately, the growing demand for mammograms is limited by insufficient number of radiologists . A CAD system can be used to assist radiologists in differentiating between normal and suspicious regions, and thus reducing number of unnecessary biopsies and false-positive rates (FP) by the radiologist, FP is an erroneous positive diagnosis when the breast is normal.
Several rough set-based and fuzzy-based methods have been proposed in literature for breast cancer detection. Hassanien and Ali  proposed a rough set technique for feature reduction and classification-rule generation from mammographic images. Hu et al.  proposed a rough set model (RSM) based on relational algebra that replaces the traditional rough set models. Their proposed algorithm is very efficient in large data sets and may be adaptable for real-time applications. Şahan et al.  proposed a hybrid machine learning algorithm by hybridizing k-nearest neighbor algorithm with a fuzzy-artificial immune method where a 10-fold cross validation criterion was used to compute algorithm’s accuracy. Hassanien  proposed a hybrid method that first uses fuzzy logic to enhance image contrast, extracts region of interest, and enhances its edges. Then, the gray-level cooccurrence matrix is used as a feature extraction method. RSM is used for further subset selection and rule generation and classification. RSM can also be used as a feature selection algorithm [7–10] while fuzzy logic as a classifier [11–13].
In , an algorithm was proposed that combined PCA, ICA, and fuzzy classifier for breast cancer detection. Membership functions of fuzzy sets were generated from the product space of the selected features. Also, the selected features from PCA-ICA phase suffered from data inconsistency which degraded the fuzzy classifier performance. In this work, an integration of PCA, ICA, Rough Set, and fuzzy classifier to identify and label suspicious regions from digitized mammograms is developed. Results of this system showed a higher efficiency in detecting suspicious regions and reducing false-negative (FN) rates in comparison with the results of  where FN is an erroneous negative diagnosis but the breast tissue has cancer. This work presents a new approach since the mapping range is integrated into the rough set model as opposed to being part of a fuzzy classifier as was the case with . The RSM is integrated into the proposed system as a feature subset selection method in order to reduce the impact of data inconsistency. Finally, the membership functions of the fuzzy sets are based on the mean and standard deviation of the testing data.
In  an algorithm was proposed that combined ICA with RSM for breast cancer detection where ICA was used for feature extraction and reduction while in this work PCA is used for feature reduction since PCA is superior to ICA in dimensionality reduction which will enhance the ICA performance, and since it is recommended to preprocess the data through whitening prior to ICA as a tool to reduce the complexity of the problem , PCA was a natural choice since whitening is an intrinsic step in PCA.
The novelty of this work is the integration of RSM for feature selection with a fuzzy classifier as well as generating the framework for the integration of the PCA, ICA, RSM, and fuzzy classifier for breast cancer detection. The rest of this paper is organized as follows. Section 2 presents a brief introduction to PCA, ICA, and RSM. Section 3 presents fuzzy logic adaptation while the proposed approach is presented in Section 4. Experimental results are presented in Section 5 followed by Conclusions in Section 6.
PCA is an orthogonal transform and a decorrelation technique that captures maximum variance. The correlation between components of a vector is used to measure data redundancy. This means that most of the information contained in the original vector can be represented by a much smaller vector after the PCA stage. In this paper, PCA is used as a dimensionality and noise reduction module. This step ensures that the source components of a vector are uncorrelated.
ICA is a statistical technique that can be used to extract hidden features within a set of data.
A mammographic image X can be expressed as a linear mixture of a set of features or basis functions as shown in (1):
where are stochastic coefficients that are data dependent. Other transforms such as Wavelets and Gabor assume basis vectors that are independent from data while ICA assumes basis vectors that costumed to the data under consideration. Using matrix notations, (1) can be expressed as shown in (2):
where is a matrix contains the source components and is the mixing matrix. This means that a mammographic image consists of a mixture of source components . Their combination can be described using the coefficients of the mixing matrix which can be used as extracted features that describe efficiently any normal and suspicious region.
The ICA algorithm estimates the separating matrix (inverse of ) that makes the source components as statistically independent as possible with non-Gaussian (super-or sub-Gaussian) distribution which results in obtaining independent components as shown in (3). This means that should be a square matrix which can be achieved by preprocessing of PCA: The ICA algorithm can be presented as an optimization process of which an objective function is modeled to minimize statistical dependency between the source components. The statistical estimation of the and matrices is a result of this optimization process. The dependency between the source components can be minimized using several suggested methods such as minimizing the mutual information of the components representation , maximizing their likelihood , or maximizing their non-gaussianity [19, 20].
Rough set theory can be used as a feature subset selection algorithm. RSM determines and removes the dispensable attributes representing the redundant information within the data while it aims to keep the core attributes representing the minimum essential information.
By relaxing the core algorithm, more attributes can be selected which are called Reduct. In this paper, Reduct attributes are considered as the minimum selected features. The selected Reduct should have the same discernibility and representation power as the original data.
Cardinality is used to replace traditional rough set theory operations. Therefore, algorithm efficiency will be improved with reduced complexity. The cardinality of a set is defined as the number of elements in the set. For example, Table 1 shows three selected features for 8 images (symbols are used instead of pixel values for simplicity). The decision is either normal () or suspicious () image. The cardinality of Table 1 is
where I = Feature 1, Feature 2, Feature 3, Decision. Core attributes should be in every Reduct to ensure correct classification. Therefore, removing any core attribute affects the classifier accuracy. Hu et al.  defined the core attributes by (5) as
where is the decision matrix , is the condition attributes (selected features), is the decision attribute (normal or suspicious image), and is the current attribute to be classified as a core or not. The merit value of an attribute or the significance of the attribute is calculated using (6) which is a measure of the degree of dependency for an attribute on the condition and decision attributes:
Two objects are considered consistent if they have the same condition and decision values. For example, in Table 1, the 2nd and the 8th objects are said to be consistent. On the other hand, the 6th and the 7th objects are inconsistent. Inconsistent objects are conflicting objects since they have same selected features but belong to different classes. Rough set model is used in this work to reduce number of inconsistent objects.
3. Fuzzy Logic
Human reasoning can be emulated using fuzzy logic. Fuzzy logic is proved to be a powerful tool to handle and process noisy and vague data. Fuzzy rules are more flexible than crisp rules for many reasons. They allow partial set membership and overlapping between fuzzy set definitions which should simplify the classification phase as opposed to crisp rules that are restricted to either a membership or nonmembership to the set. Also, they can be expressed in terms of linguistic statements based on expert knowledge. Finally, the interpretability of the results can be improved by fitting fuzzy rules to the labeled observed data.
Fuzzy membership functions are easy to implement and they improve speed of inference engines. The difference between normal and suspicious mammographic images may not be well defined. Figure 1 shows, for example, that the object has a membership degree of 0.7 to the fuzzy set “normal” and 0.3 to the fuzzy set “suspicious”.
Several approaches have been developed for automatic derivation of fuzzy rules from the labeled observed data such as genetic algorithm , Neuro-fuzzy , and fuzzy clustering . In all, the derived fuzzy rules should be accurate, compact, and linguistically interpretable.
Fuzzy if-then rules are used to implement membership function of fuzzy sets as shown in (7):
The weight is a number in the interval [, ] that can be evaluated based on the antecedent numbers. For example, a tested subimage has a membership degree of 0.7 to the fuzzy set “normal” and 0.3 to the fuzzy set “suspicious”. In this case, a single fuzzy if-then rule can be used which produces a classifier output of normal for the tested subimage as shown by the following.
where and are the membership degrees for the membership functions. Applying (9) to the antecedent of (8) will result in selecting the normal fuzzy set from the antecedent with membership degree of 0.7 as follows:
The antecedent results are applied then to the consequent, which is known as the inference step. In this case, the classifier will label the tested subimage as normal.
4. Proposed CAD Algorithm
This paper integrates four techniques, namely, PCA, ICA, Rough Set, and Fuzzy classifier to build a CAD system. PCA algorithm is used as a dimensionality and noise reduction tool (prewhitening), and ICA algorithm is used as a feature extraction module while RSM is used as a feature subset selection module followed by a fuzzy classifier.
4.1. Data Preprocessing
119 regions of suspicion (ROS) are manually extracted from MIAS database  based on center of each abnormality of which 51 are malignant and 68 are benign regions. Two sets are formed where the first set is with subimages of size while the second set of size pixels.
Four other sets of normal subimages are randomly and automatically extracted such that the first set is of size and the other sets are of size pixels from the normal MIAS mammograms. Each set has 119 subimages. Each set of ROS is mixed with one set of normal subimages and then divided into two groups: one for the training phase and the other is for the testing phase as shown in Table 2. Figure 2 shows a sample of the extracted subimages.
4.2. Training Phase Using PCA-ICA
A training matrix is constructed by placing training subimages as rows in the matrix where represents number of training subimages (119) and M represents size of each square subimages. PCA algorithm is used to reduce its dimensionality according to the following equation where represents number of selected principal components and represents a matrix with the principal components in its columns sorted by descending order according to their variances
In this paper, ICA scheme is based on minimizing the mutual information of the source components which can be achieved using cumulants. This is proposed (a modified version of ) in order to estimate the separating matrix and the independent source region matrix in an unsupervised mode as follows.
(i) is initialized to the identity matrix. Then, is calculated using the following equation. This means that ICA is performed on a set of linear combinations of the original subimages instead of performing it on all subimages. This should reduce its computational complexity and hence increase its speed:
(ii) The change in is calculated using the natural gradient , that is,
where is the learning rate (step size), is the identity matrix, and must be a nonlinear and nonfast growing function. This function is used to measure the statistical dependence between the source components. In this paper,  is used as follows:
where and are the 3rd and 4th cumulants and () indicates Hadamard product of two matrices and
as were defined in .
(iii) The momentum method is used to boost the convergence speed of (13) using
where is in the range [, ]. In this paper, alpha is chosen to be 0.5.
(iv) The separating matrix is updated and then normalized:
(v) Stop the algorithm when converges.
Finally, the reduced dimensionality selected features can be estimated as follow.
4.3. Testing Phase Using PCA-ICA
First, a testing matrix is constructed, where each testing subimage forms a row in the matrix. Second, its rows are normalized by their mean. Third, The regions in are projected on the reduced data from the training procedure using (22):
4.4. Mapping into a Limited Range
The estimated matrices and contain rows where each row contains selected features from the corresponding subimage. A linear stretching method is used to map them into a limited range of [0, r] using (24):
4.5. Rough Set Model
There are some inconsistent elements (subimages) in the estimated matrices and . These elements have same selected features but belong to different classes. Rough Set Reduction is used as a subset selection in order to remove features that cause inconsistency and thus improve classification results.
4.5.1. Training Phase
The proposed training framework can be summarized as follows.(1)The consistent elements from the training matrix are removed. The resulting matrix is where .(2)Construct the decision matrix, , where contains the condition attributes (selected features from PCA-ICA phase) and is the decision attribute (1: abnormal, 0: normal).(3) Find the Core attributes using the following procedure. (i)Initialize Core vector into .(ii)Check the cardinality for each attribute ; if it satisfies then update core vector as .(4)Find Reduct attributes using the following procedure which is a modified version of .(i)Initialize Reduct vector: Reduct = Core.(ii)Set and compute the significance of its attributes using: (iii)Let be the attribute with the largest significance value, update Reduct as: (iv)Update .(v)If or the significance values of the remaining attributes are zeros, stop the procedure. Equation (26) means that Reduct has inconsistent elements (with ratio of ) greater than or equal to that of the decision matrix: (vi)Else, go to step (II).
4.5.2. Testing Phase
In this step, features are selected from the matrix in the same order they were selected from during the training phase.
Finally, and are reconstructed with selected Reduct features while dispensable features are thrown away.
4.6. Fuzzy Classifier
Two single fuzzy if-then rules are used to represent the normal and abnormal fuzzy sets. The membership functions of each antecedent fuzzy set are aggregated using the information about the selected feature values of the training subimages.
The proposed fuzzy-based classification algorithm can be summarized as follows:(1)Two activation functions and are initialized to 0 where each element of them represents the aggregated membership functions of the selected feature values for the corresponding testing subimage. These parameters are defined as. (i) represents the membership degree of the kth testing subimage to the fuzzy set abnormal.(ii) represents the membership degree of the kth testing subimage to the fuzzy set normal where .(2)Using (27), membership functions of fuzzy sets of the testing subimages are obtained from the mean and standard deviation of their selected features based on the information from the selected feature values of the training subimages: where represents mean of all samples of the current selected feature , represents their standard deviation, and is an index for the selected features from the training phase.(3)The membership functions are normalized using (4)The membership functions are aggregated using (29) in order to find the degree of activation of each fuzzy set where is an index for the selected features from the testing phase: (5)By assigning the corresponding testing subimage into the fuzzy set with the maximum degree of activation, a crisp decision is made, that is, normal or abnormal. Equation (30) is used for this purpose where is used as an index of a testing subimage being identified as normal or abnormal:
5. Experimnetal Results
Table 3 presents results of using PCA-ICA-Rough-Fuzzy (PIRF), PCA-ICA-Fuzzy (PIF), PCA-Fuzzy (PF), PCA-Rough-Fuzzy (PRF), ICA-Fuzzy (IF), and ICA-Rough-Fuzzy (IRF) in terms of accuracy, recall, precision, FN rates, and FP rates as computer-aided detection systems. Algorithm accuracy is defined as the ratio between the total number of correctly classified subimages to the total number of testing subimages.
Table 4 compares the performance of these CAD systems. Our proposed PIRF CAD system shows a robust performance in comparison with the other algorithms. For example, PIRF achieved an average accuracy of 77.73%, PIF of 75.21%, IRF of 74.16%, PRF of 71.85%, PF of 71.64%, and IF of 49.58%. As Table 3 shows, PIRF has the highest recall percentage among all the other algorithms while it has an average precision of 73.33%. PIF and IRF have average precision of 75.83% each.
As the results show, fuzzy classifier cannot be implemented with ICA model alone without a dimensionality reduction since, without it, a large number of membership functions will be generated. Also, without a feature subset selection module, the classifier task complexity is increased and performance is degraded. Furthermore, results indicate that integrating ICA model with PF generated better results than integrating RSM with PF. The average accuracy was improved by 4.68% and false negative rates were improved by 4.76% if a PCA model was used with the ICA model while following it with RSM improved its average accuracy by 0.29% and its FN rates by 6.33%. Integrating RSM improved total PF algorithm performance by 0.29% but degraded its FN rates by 6.34%. Results also indicate that RSM and PIF integration improves accuracy with an average of 3.35%.
Comparing the results using FN rates, we find that PIRF has an FN of 8.82%, PIF of 12.61%, IRF of 13.66%, PF of 13.24%, PRF of 14.08%, and IF of 40.34%. Results indicate that using PCA as a dimensionality reduction module reduces FN rates in PIRF and PF at the expense of a little increase in the FP rates. Also, average FN rates are very close to average FP rates in PIF and PRF algorithms. On the other hand, average FN rates are increased in IRF and IF algorithms when no dimensionality reduction was integrated. Finally, integrating RSM into PIF and PF algorithms reduces the number of principal components required to obtain Reduct. The previous discussion shows that each one of the integrated techniques (PCA, ICA, RSM, and Fuzzy Classifier) is necessary and should be implemented in the proposed sequence in order to achieve the highest accuracy rates.
An implementation of the PIF proposed in  reports, Table 3, a lower accuracy than our proposed PIRF system in two testing sets while they had same accuracy in the other two testing sets. The average accuracy of the PIF in all test sets is 75.21% while 77.73% for PIRF. FN rates improved in three testing sets for the PIRF in comparison with the PIF. The average FP and FN rates of the PIF are 12.19% and 12.61%, respectively, while 13.45% and 8.82% for PIRF. These observations are summarized in Table 5.
The average accuracy for PIF improved by 3.35% with PIRF system and its average FN rate improved 30.01%. Also, the average selected number of principal components in PIRF algorithm which is 7.75 is less than that of PIF algorithm which is 9.75. In other classification methods such as in , three sets of sizes , , and pixels were extracted from MIAS mammographic images where each set consists of 330 subimages. Their results were 65.71%, 59.36%, and 82.22% for the three sets using ICA-Rough algorithm and 81.9%, 88.57%, and 69.27% using PCA-Rough algorithm.
The proposed CAD system uses several parameters that impact performance accuracy such as number of the principal components in the PCA algorithm, learning rate and alpha in the ICA algorithm, threshold in the Reduct process, and mapping range.
Number of PCs Selected
Reducing data dimensionality using PCA module affects PIRF algorithm accuracy. When large number of principal components is selected, extracted features will have redundant information and therefore will degrade the performance accuracy. However, if a small number is selected, extracted features cannot be estimated precisely and the fuzzy classifier performance will also be degraded.
Table 6 shows the highest accuracy for four testing sets using different numbers of the selected principal components while the other parameters are kept constant. Results also show that selecting less than 9 principal components achieves best results in all cases which means that less than 0.65% of the image features are selected for the subimages and less than 0.4% of the image features are selected for the subimages. This is in agreement with all reported literature that used PCA algorithm for dimensionality reduction [14, 15].
On the other hand, Figure 3 shows the Receiver operating characteristic (ROC) plot for a different number of selected principal components for testing set number 4. This figure is generated by plotting true positive rates against false positive rates. As the figure indicates, selecting five principal components produces the largest area under the curve which means that it produces the highest average accuracy.
The estimation of the matrices and is affected by the learning rate, which determines the speed and accuracy of convergence to the optimal value. Since optimal values of and are unknown and they are data dependent, optimal value of cannot be estimated adaptively. Also, since represents the step size for choosing a small value of it ensures accuracy but reduces the speed of convergence. Learning rate impact on four testing sets is shown in Figures 4, 5, 6, and 7 where all parameters were kept fixed except for the learning rate. Figure 8 shows the ROC plot for different values of the learning rate for testing set number 4. As the figure indicates, the smallest value of (0.001) produces the largest area under the curve which means that it produces the highest average accuracy.
Momentum Method Constant
This constant determines the ratio of the previous that should be added to the current to increase the convergence speed of . Since utilizes the natural gradient to find direction of toward a minimum point, adding its previous value to its current value pushes it toward the minimum point faster but does not change its direction.
In investigating the mapping range values’ effect on the accuracy of the results, we found that mapping the data into a limited range results in accuracy loss but simplifies computational complexity and processing time. Figures 9, 10, 11, and 12 show accuracy results versus mapping range for four testing sets while other parameters are kept constant. Figure 13 shows the ROC plot for different values of the mapping range for testing set number 4. As the figure indicates, choosing a mapping range in the interval produces the largest area under the curve which correlates with the highest average accuracy.
A threshold value is necessary, (26), as a criteria to stop the Reduct procedure. This determines the number of selected features and consequently affects the classifier accuracy. Table 7 shows the impact of on results of test set number 1. These results indicate that selecting a threshold equal to 1 achieves the highest performance. The optimum value is the value, at which the Reduct attributes are complete, at which the number of inconsistent rows equals to that of the decision matrix. Furthermore, the cropped size impacts the accuracy of the results as shown in Table 3. As the table shows, the larger subimages (of size pixels) resulted in the highest accuracy.
6. Conluding Remarks
A computer-aided detection system has been developed and implemented by integrating PCA, ICA, RSM, and a fuzzy classifier. Its performance is compared against the performance of PCA-ICA-Fuzzy, PCA-Fuzzy, PCA-Rough-Fuzzy, ICA-Fuzzy, and ICA-Rough-Fuzzy algorithms.
Results from Tables 3 and 4 indicate that PCA algorithm should be used in order to reduce FN rates at the expense of FP rates. It is shown that integrating RSM and PCA in one algorithm allows for a lower number of principal components to be selected while maintaining the performance accuracy as opposed to use PCA without RSM. Using ICA model and fuzzy classifier produced a CAD system with poor performance unless PCA is used for dimensionality reduction. RSM is used for further features reduction in order to reduce data inconsistency and consequently improve classifier performance. Results also indicate that PCA algorithm should be followed be ICA algorithm instead of RSM. Results of Table 3 indicate that the proposed PIRF algorithm is robust in comparison with the other algorithms. Finally, the proposed CAD algorithm reduces the FN rates considerably which is the main concern of CAD systems.
Parameter values as well as block size play a vital role in the system’s performance and an investigation of this relation and perhaps automation of their selection is needed to further improve system’s robustness. Although cumulants offer simple computations, they are sensitive to outliers (large values within the set). Therefore, an alternative route that may be worthwhile to investigate is to use a learning rule of the ICA algorithm that is based on negentropy instead of cumulants.
The authors would like to acknowledge Western Michigan University for its support and contributions to the Information Technology and Image Analysis (ITIA) Center, funded by the National Science Foundation Grant (MRI-0215356).
National Cancer Institute, U.S. National Institute of Health, May 2009, http://www.cancer.gov/cancertopics/types/breast.
J. Billingsley, “Radiologists' mammogram accuracy varies widely,” June 2005, http://news.healingwell.com/index.php?p=news1&id=526229.View at: Google Scholar
A. E. Hassanien and J. M. H. Ali, “Enhanced rough sets rule reduction algorithm for classification digital mammography,” Journal of Intelligent Systems, vol. 13, no. 2, pp. 151–171, 2004.View at: Google Scholar
X. Hu, T. Y. Lin, and J. Han, “A new rough sets model based on database systems,” Fundamenta Informaticae, vol. 59, no. 2-3, pp. 135–152, 2004.View at: Google Scholar
C. Cornelis and R. Jensen, “A noise-tolerant approach to fuzzy-rough feature selection,” in Proceedings of the 17th IEEE International Conference on Fuzzy Systems, pp. 1598–1605, 2008.View at: Google Scholar
A.-M. Yang, Y.-X. Yang, and S.-Y. Jiang, “Approaches of individual classifier generation and classifier set selection for fuzzy classifier ensemble,” in Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '08), vol. 1, pp. 519–524, 2008.View at: Publisher Site | Google Scholar
A. Çelikyilmaz, I. B. Türksen, R. Aktas, M. Mete Doganay, and N. Basak Ceylan, “A new classifier design with fuzzy functions,” in Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, vol. 4482 of Lecture Notes in Computer Science, pp. 136–143, Toronto, Canada, May 2007.View at: Google Scholar
R. Swiniarski, H. K. Lim, J. H. Shin, and A. Skowron, “Independent component analysis, principal component analysis and rough sets in hybrid mammogram classification,” in Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV '06), vol. 2, pp. 640–645, Las Vegas, Nev, USA, June 2006.View at: Google Scholar
A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, John Wiley & Sons, New York, NY, USA, 2001.
G.-J. Jang and T.-W. Lee, “A maximum likelihood approach to single-channel source separation,” Journal of Machine Learning Research, vol. 4, no. 7-8, pp. 1365–1392, 2003.View at: Google Scholar
M. Girolami and C. Fyfe, “Negentropy and kurtosis as projection pursuit indices provide generalized ICA algorithms,” in Advances in Neural Information Processing Systems, Blind Signal Separation Workshop, pp. 249–266, Aspen, Colo, USA, December 1996.View at: Google Scholar
A. Riid and E. Rustern, “Neuro-fuzzy extraction of interpretable fuzzy rules from data,” in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC '04), vol. 3, pp. 2266–2271, The Hague, The Netherlands, October 2004.View at: Google Scholar
S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, no. 2, pp. 251–276, 1998.View at: Google Scholar