Abstract

Breast Cancer is the most prevalent cancer among women across the globe. Automatic detection of breast cancer using Computer Aided Diagnosis (CAD) system suffers from false positives (FPs). Thus, reduction of FP is one of the challenging tasks to improve the performance of the diagnosis systems. In the present work, new FP reduction technique has been proposed for breast cancer diagnosis. It is based on appropriate integration of preprocessing, Self-organizing map (SOM) clustering, region of interest (ROI) extraction, and FP reduction. In preprocessing, contrast enhancement of mammograms has been achieved using Local Entropy Maximization algorithm. The unsupervised SOM clusters an image into number of segments to identify the cancerous region and extracts tumor regions (i.e., ROIs). However, it also detects some FPs which affects the efficiency of the algorithm. Therefore, to reduce the FPs, the output of the SOM is given to the FP reduction step which is aimed to classify the extracted ROIs into normal and abnormal class. FP reduction consists of feature mining from the ROIs using proposed local sparse curvelet coefficients followed by classification using artificial neural network (ANN). The performance of proposed algorithm has been validated using the local datasets as TMCH (Tata Memorial Cancer Hospital) and publicly available MIAS (Suckling et al., 1994) and DDSM (Heath et al., 2000) database. The proposed technique results in reduction of FPs from 0.85 to 0.02 FP/image for MIAS, 4.81 to 0.16 FP/image for DDSM, and 2.32 to 0.05 FP/image for TMCH reflecting huge improvement in classification of mammograms.

1. Introduction

Breast cancer is the most common cancer disease among women across worldwide. It is the leading cause of deaths for women suffering from cancer disease in India. It is estimated that breast cancer cases in India would reach to as high as 1,797,900 by 2020 [1]. Rising rate of incidences can cause high mortality. This is due to lack of awareness about breast screening, late reporting, and insufficient medical access [2]. This fact brings a concern and necessity that screening for breast cancer is prudent in its early stage to confirm longer survival. Among all techniques, namely, mammography, tomosynthesis, ultrasonography, computed tomography, and magnetic resonance, mammography is the most reliable and accepted modality by radiologist for preliminary examination of breast cancer due to cost benefits and accessibility [35]. The diagnosis of breast cancer using mammogram by radiologist varies from expert to expert as symptoms are misinterpreted or overlooked, due to the tedious task of screening mammograms. Study reveals that 10% to 30% of the visible cancers on mammograms are overlooked, and only 20% to 30% of biopsies are positive [68]. Biopsies are traumatic in nature and costly; therefore, computer aided detection and diagnosis (CAD) systems combined with expert radiologists’ experience would provide more comprehensive diagnosis [9]. Detailed survey about the research in the design of CAD systems has been given in next section.

2. Literature Survey

The design and development of CAD system is an important progressive area of research for contrast enhancement for better visualization and clarification [1012], pectoral muscle removal, segmentation for better delineation of region of interest (ROI), extraction of features, and classification [13, 14]. The segmentation method is classified as region based, contour-based, and clustering method [15]. The region and contour-based methods are popularly used by many researchers. Görgel et al. [16] developed Local Seed Region Growing-Spherical Wavelet Transform (LSRG–SWT) algorithm using local dataset and MIAS [17] with classification accuracy of 94% and 91.67%, respectively. Pereira et al. [18] presented segmentation and detection of masses in mammogram using wavelet transform and genetic algorithm that provides FP rate of 1.35 FP/image and sensitivity of 95% using DDSM [19]. Rouhi et al. [20] studied segmentation using region growing, Cellular Neural Network (CNN), and ANN. The result of classification varied from 80 to 96%, which is the main weakness of their study. Berber et al. [21] proposed Breast Mass Contour Segmentation (BMCS) approach and showed 6 FPR for local dataset. Hybrid level set segmentation method [22] based on combination of region growing and level set was used to segment tumor. The results showed that the sensitivity varied from 78 to 100% due to the presence of artifact in the MIAS database. The difficulties in region and contour-based segmentation methods are the appropriate initialization of seed point and contour position.

Several researchers have implemented clustering method like K-means and Fuzzy C-means (FCM) for breast abnormality segmentation [3, 23]. However, they have limitations in terms of learning abilities. Learning-based techniques such as Self-organizing map (SOM) [24] have been successfully used in medical image segmentation [25]. The success of SOM in medical image segmentation has inspired the researcher to choose it for mammogram segmentation. Many of the times the tumor-segmented regions are not the abnormal tissues (cancerous region), and they are known as false positives (FPs). This FP consumes much time of radiologists and results into unnecessary biopsies. Thus, reducing the FPs is an open research problem and various researchers have proposed FP reduction algorithms to improve the specificity of the CAD systems [5, 9, 23, 2631]. Usually, FP reduction algorithm is postprocessing step of a CAD system with two stages namely: Feature extraction and Classification. Various methods have been developed for feature extraction based on wavelets [8, 18, 32], curvelet [33, 34], Gabor [35, 36], morphological descriptors [20], textural analysis [26, 27, 30, 32], histogram [4, 5, 7, 29, 3740], etc. The segmentation error can reduce the performances of morphological descriptor. When Gray Level Co-occurrence Matrix (GLCM) from normal and abnormal region in dense mammogram is same, texture descriptor overlaps that leads to more number of FPs [37]. Ojala et al. proposed local binary patterns (LBPs) [41] for textural feature extraction which works well in feature extraction as compared to morphological descriptor and GLCM-based textural descriptor. LBP descriptor can be considered as local microstructures, namely, edges, flat areas, spots, etc. Variants of LBP have been proposed by various researchers to achieve rotation and intensity invariant features. Also, LBP is computationally efficient and extracts robust features; therefore, LBP descriptors have been widely applied in FP reduction and classification methods for mammogram images [29, 37, 39, 40]. However, LBP descriptor does not provide the directional information of local micropattern. Therefore, transform technique such as curvelet combined with LBP was used to extract features. Various curvelet-based approaches have been proposed in the literature [8, 33, 34, 42] which conclude that curvelet outperforms as compared to wavelet transform.

In this work, novel method of extracting sparse curvelet subband coefficients by incorporating the knowledge of irregular shape of masses as they appear in sparse matrix and calculating LBP features has been presented. Therefore, this paper presents scheme as follows:(1)Preprocessing of mammogram image for contrast enhancement using local entropy maximization-based image fusion algorithm and removal of background noise(2)Cluster-based segmentation of mammograms using SOM and extract tumor regions, i.e., ROI)(3)FP reduction: extraction of sparse curvelet subband coefficients and computation of LBP descriptor to classify true positives and false positives to improve performance of CAD system using MIAS [17], DDSM [19], and Tata Memorial Cancer Hospital (TMCH) datasets.

The organization of paper is as follows: Sections 1 and 2 illustrate the introduction and literature review on automatic segmentation and extraction of abnormal masses (i.e., tumor region) as well as FP reduction methods. Section 3 presents the proposed methodology for SOM based segmentation of mammograms followed by novel false positive reduction in detail. Section 4 depicts the experimental results and discussions on three benchmark datasets. Finally, Section 5 concludes the proposed approach for accurate extraction of abnormal masses (i.e., tumor region) by excluding the FPs.

3. Methodology

The block schematic of proposed integrated method for automatic detection of breast cancer using sparse curvelet coefficient-based LBP descriptor has been shown in Figure 1.

3.1. Preprocessing

The mammogram images are low-dose x-ray images so they have poor contrast and suffer from noises. The preprocessed mammogram image as shown in Figures 2(a)2(d) represents preprocessing of mammogram, and Figures 2(e)2(g) represents SOM clustering and ROI extraction.

3.1.1. Local Entropy Maximization-Based Image Fusion: Contrast Enhancement

The contrast enhancement of the mammogram is performed using local entropy maximization [12] for better segmentation. Here, original image is given to the contrast limited adaptive histogram equalization (CLAHE) algorithm to get the second input to our image fusion algorithm. Further, original image along with the CLAHE has been given to the image fusion algorithm. Procedure of the image fusion has been given in Algorithm 1. We have used local entropy as a fusion rule given by the following equation:where is the local entropy and and are the probability of pixel from 5 × 5 sliding window [12]. Here, both high frequency components from original mammogram and CLAHE mammogram have been fused using maximum entropy criteria. Figure 3(b) presents contrast-enhanced mammogram using local entropy maximization-based image fusion.

(1)Load input image (img1)
(2)Apply CLAHE algorithm and obtain enhanced image (img2)
(3)Decompose img1 and img2 up to 3 level of decomposition using Discrete Wavelet transform (DWT)
(4)Use maximum local entropy rule for fusion of img1 and img2 for high frequency subbands
(5)Take inverse DWT to obtain the fused image
3.1.2. Pectoral Muscle Removal

Pectoral muscle suppression has been performed by defining rectangle as suggested in [14] (Figure 3(c)). It illustrates the rectangle (ABDC) and fixes the points G and has intensity variation and joins them for pectoral muscle suppression. Figure 3(d) illustrates pectoral muscle removed image to avoid discrepancies in the algorithm because of similar intensities present between pectoral muscle and masses.

3.2. SOM Clustering

SOM is a special type of neural network designed to map the input image of size to M clusters based on their characteristic features [25]. For SOM, the image (I) is converted into a feature vector , where m is the number of features. In this experiment, we have trained SOM with M = 4 clusters using neighbourhood features such as given a centre pixel () in the image, the neighbourhood features are computed as given in the following equation:where n is the number of neighbourhood ( window), is the neighbourhoods, and F is the feature vector corresponds to centre pixel . The selection of 3 × 3 window pixel is based on [43] to capture local details.

At the start, weight vector is random and updated as the network learns. The minimum Euclidean distance is described as the best matching component or winner node and described as

Weight vector for winning output neuron and its neighboring neurons are updated aswhere is time coordinate. The function is the neighbourhood kernel function and expressed aswhere is the learning rate, is a width of kernel that corresponds to neighbourhood neurons around node c and and corresponds to location vectors of nodes c and i.

Figures 4(a) and 4(b) represent cluster map and cluster boundaries marked on mammogram. After the several observations for known areas, it was empirically noticed that number of pixels of range or pixel level threshold (PLT based on pixel count in TP) as 450 to 31,500; 16,000 to 2,00,000; and 4,000 to 2,00,000 consist of abnormality for MIAS, DDSM, and TMCH database, respectively, which is verified from the expert. The size of the tumor is varying because of the mammogram size of pixels for MIAS, pixels to pixels for DDSM, and or pixels for TMCH datasets. Therefore, cluster regions below or above the specified threshold are discarded and the remaining region is marked as true positive (TP) as shown in Figure 4. Figure 4(a) shows the clustered image using SOM; Figure 4(b) shows the cluster boundaries marked on original image.

We can see that there are many FPs along with TP (marked by pink color) which are reduced using pixel level threshold (PLT based on pixel count in TP) as explained above. Figure 4(c) shows the filtered result using PLT.

3.3. ROI Extraction

After SOM clustering (initial segmentation), the next step is to classify the detected regions into TP and FP by using proposed local sparse curvelet features (LSCF) followed by ANN classifier. To do so, initially, we have extracted ROIs from detected regions by SOM clustering and manually categorized into TP and FP. We collected these ROIs from three different datasets according to their maximum height and maximum width using connected components e.g., region marked in Figure 4(c). Therefore, their patch size is different as shown in Figure 5, ROIs for MIAS, DDSM, and TMCH dataset. Further, these extracted patches have been used to train the ANN for the task of FP reduction.

3.4. False-Positive (FP) Reduction

After ROI extraction, FP reduction algorithm performs computation of proposed local sparse curvelet features (LSCF) followed by ANN classifier.

3.4.1. Proposed Algorithm

LBP [43] was proposed as LBP descriptor computation at circular neighbourhood which is called as uniform LBP (ULBP) descriptor and expressed aswhere

Computation of LBP based on actual shape of mass according to sparse matrix has been shown in Figure 6, where it takes pixels related to shape of mass which are called as foreground pixels and rejects the other pixels called as background pixels. The proposed algorithm uses foreground pixels only for LBP computation, and this will tend to number of pixel reduction in LBP computations. Therefore, identification of foreground and background pixels is an important step which is performed using lookup table approach. The identification of foreground and background pixel is based on number of nonzero pixels in the lookup table, i.e., if count of sliding window nonzero pixels is greater than 2, count(p(i, j)) > 2 is identified as foreground and LBP is estimated. On the other hand, if count of sliding window nonzero pixels is less than 2, count(p(i, j)) < 2 is identified as background and LBP would not be estimated and rejected from lookup table. Nonzero pixels provide actual shape of mass and are taken for LBP computations. Graphical representation of proposed algorithm for LBP descriptor computation using foreground pixels has been given in Figure 7 and the algorithm has been described in Algorithm 2.

 Input: I(m, n); m = no. of rows and n = no. of column
 Output: LBP features
 Initialize: Radius R = 1 and neighborhood pixels P = 8
    Mask = [1 2 4 8 16 32 64 128 0]
    Sliding window coordinates: k = −1 : 1
    Count = 1 //number of pixels in I(m, n)
 for i = 1 to m do
    for j = 1 to n do
       //prepare local circular window
       I_local = I(i + k, j + k)
       center_pixel = I(i, j)
       //Arrange local neighborhoods of I(i, j) pixels in a row col = 1 : 9
       Lookup_table (i, col) = reshape(I_local [7, 17])
       //count number of pixels greater than zero
       a = length(find(Lookup_table > 0))
       //select pixel position from lookup-table for computation of LBP
       if a > 2       LBP_code(count,:) = I_local > center_pixel       count = count + 1
       end
    end
 end
 //compute histogram of LBP codes
 LBP_descriptor = LBP_descriptor/count
 //scale invariant
3.4.2. The Fast Discrete Curvelet Transform (FDCT)

The authors [44] have introduced computationally simple and efficient Fast Discrete Curvelet Transform (FDCT). We have preferred wrapping-based FDCT approach in proposed work, as it is faster. The curvelet coefficients represented by scale j, angle l, and spatial location k can be written as

Figure 8 illustrates LBP code computation based on sparse curvelet coefficients; ROI decomposes using curvelet transform with scale orientations of 16° and scale of 2 as the database consists of minimum ROI size of pixels. Curvelet transform with scale orientations of 16° and scale of 2 produces different subbands based on subband division. Further, each curvelet subband coefficients have been represented using lookup table using sliding window, and if the row in the lookup table identifies foreground coefficient, then LBP is computed with radius R = 1 and P = 8 neighboring pixels as shown in Algorithm 2; total 58 LBP features have been obtained from foreground curvelet subband coefficients. Therefore, total 986 LBP features have been extracted from 17 curvelet subbands. It can be observed from Figure 8, curvelet subbands also provide shape of mass in 16 different directions so that the directional information can be associated with LBP features. Kanadam et al. [3] used concept of sparse ROI; similarly, we have extended it for sparse curvelet subband and LBP features computation.

3.5. Classification

In this work, we have analyzed extracted ROI from mammogram using normal-abnormal, benign-malignant, and normal-malignant classes with ANN, SVM, and KNN classifiers. The detailed description of ANN classifier has been given in [45, 46]. To evaluate performance of the proposed system, we have used 3-fold cross validation where database is randomly divided into three sets and accuracy is calculated for each set. The final accuracy of the system is average of accuracy of each of three sets. However, it will not be fair to compare 3-fold cross validation result of SVM and KNN classifier with ANN, because ANN classifier is tested on only one set of images (33% for training, 33% for testing, and 33% for validation). Thus, to do fair comparison, we have trained ANN using input layer (986 neuron) over three different sets (which are considered in SVM and KNN) and calculated its average accuracy. Our proposed false positive reduction algorithm illustrates in Figures 9(a)9(c). Algorithm 3 summarizes flow of the proposed method for FP reduction in mammograms.

(1)Load input image (img1)
(2)Apply CLAHE algorithm and obtain enhanced image (img2)
(3)Process img1 and img2 and obtain enhanced image using procedure given in Algorithm 1
(4)Remove pectoral muscle using proposed approach (Section 3.1.2)
(5)Extract neighbourhood features for each pixel and apply SOM clustering
(6)Obtain clustered image and separate out the tumorous cluster
(7)Extract detected regions i.e., ROI's from clustered result
(8)Extract Sparse Curvelet Coefficients (Subband) up to 2 level from each ROI
(9)Extract Sparse LBP code for each subband and obtain a combined feature vector for each ROI
(10)Classify each ROI into tumorous and nontumorous class i.e., TP and FP respectively
(11)Map each TP region on original mammogram (img1)
(12)end

4. Experimental Results and Discussions

The proposed method has been tested and validated using three classifiers and three clinical mammographic image datasets.

4.1. Data Sets
4.1.1. Mammographic Image Analysis Society (MIAS) Database

The mini-MIAS [17] database consists of 322 mammograms, each having 1024 × 1024 pixels and annotated like background tissue character, class, severity, center of abnormality, and radius of circle for abnormality. This database includes 64 benign, 51 malignant, and 207 normal cases, which have been taken for experimentation.

4.1.2. Digital Database for Screening Mammography (DDSM)

The DDSM [19] dataset consists of 2500 studies and is composed of cranial-caudal (CC) and mediolateral-oblique (MLO) views of mammographic image for left and right breast, annotated with ACR breast density, type of abnormality, and ground truth. Randomly selected 150 abnormal and 100 normal cases from both HOWTEK and LUMISYS scanner of 12 bits per pixel resolution have been subjected for experimentation.

4.1.3. The Tata Memorial Cancer Hospital (TMCH)

This dataset [47] contains 360 full-field digital mammograms (FFDMs) comprising 180 CC views and 180 MLO views from right and left breast acquired from 90 randomly selected patients. It is composed of 180 verified malignant and 180 normal breast images. It uses biopsy proven breast cancer patients’ pathological data approved by the Institutional Research Ethics Committee of Tata Memorial Centre Hospital (TMCH), Mumbai, India. The ground truth marking on each abnormal mammogram is performed manually using the Histopathological Reports (HPR) of the respective patients and expert radiologist from TMCH, Mumbai. Approximately 35 patients are examined using “Hologic Selenia System” (Scanner1) gives 16-bit.

The remaining 55 patients were examined with “GE Medical Senograph System” (Scanner2) providing 8-bit true color mammogram image in DICOM format of 4096 × 3328 or 2294 × 1914 pixels each measuring size 50 × 50 μm2.

4.2. Segmentation Evaluation and ROI Extraction

The segmentation using SOM that detects suspicious mass regions is considered as TP whereas from nonmass is taken as FP. From Table 1, it is clear that total suspicious ROI (including TP & FP) of 381 for MIAS, 1343 for DDSM, and 1009 for TMCH have been taken for evaluation our proposed algorithm for FP reduction.

From extracted ROIs, the minimum patch size is 25 × 22 pixels whereas the maximum size is 1152 × 1356 pixels. Tables 2 and 3 represent curvelet subband coefficients from 17 subbands, and reduced coefficients based on lookup table approach are used to calculate LBP features. It has been observed during experimentation that the curvelet coefficients on an average are reduced for sparse LBP by 14%, 32%, 33%, and 34% for MIAS, DDSM, TMCH: Scanner1, and TMCH: Scanner2, respectively. It may be noticed that reduction in curvelet coefficients for every ROI is not fixed. It completely depends upon the shape of the ROI as per the sparse matrix. Tables 2 and 3 do not represent exact reduction in pixels for complete database, but they exhibit pixel reduction for sample mammograms.

4.3. Classifier Evaluation and False-Positive Reduction

From Figures 1013, the best classification accuracy of 98.57 % has been obtained for MIAS in benign versus malignant classification, whereas 98.70% for DDSM, 98.30% for TMCH: Scanner1, and 100% for TMCH: Scanner2 classification accuracies have been obtained in normal versus malignant classification. The classification performance of ANN has improved from 6% to 43% for different databases as compared to KNN classifier, whereas there is little improvement about 7% compared with SVM classifier. The performances of both proposed sparse LBP and LBP computation on curvelet subbands are nearly same; therefore, the proposed algorithm can be efficiently implemented in CAD system with lesser number of curvelet coefficients.

Data augmentation has been used for some classes to maintain balance between two classes, to improve performance, and to learn more powerful model. Table 4 explains the FP reduction with the use of curvelet-based LBP features and ANN. It has been observed that FP reduced from 0.85 to 0.02 FP/image in MIAS, 4.81 to 0.02 FP/image in DDSM and 2.32 to 0.13 FP/image in TMCH.

Similarly, Table 5 shows the reduction in FPs as 0.85 to 0.01 FP/image for MIAS, 4.81 to 0.03 FP/image for DDSM, and 2.32 to 0.00 FP/image for TMCH using sparse curvelet coefficient-based LBP features. The results show the effectiveness of sparse curvelet coefficient-based LBP and ANN. From Table 6, the best value of AUC = 0.99 is obtained in benign versus malignant classification for MIAS, AUC = 0.98 in benign versus malignant in case of DDSM, AUC = 0.94 in normal versus malignant in case of TMCH: Scanner1, and AUC = 0.96 in normal versus malignant classification in TMCH: Scanner2 using ANN and curvelet subband-based LBP features. The worst performance of AUC = 0.53 for MIAS is obtained with the proposed algorithm using KNN classifier as shown in Table 7. Similarly, from Table 7, the best value of AUC = 0.98 is obtained in TMCH: Scanner1, AUC = 1 is obtained in TMCH: Scanner2 database for normal versus malignant classification, AUC = 0.98 in benign versus malignant classification is attained in MIAS database, and AUC = 0.98 is achieved for normal versus malignant classification in DDSM database using ANN classifier for sparse curvelet subband-based LBP features.

However, from Table 7, it should be noted that the performance of proposed algorithm is the best using ANN classifier. Figure 14 represents automated CAD system for breast cancer diagnosis with sample mammograms.

Table 8 provides comparative study of methods developed for breast tissue classification. The proposed method provides best results in terms of AUC and reduction of number of FPs as 0.85 to 0.01 FP/image for MIAS, 4.81 to 0.03 FP/image for DDSM, and 2.32 to 0.00 FP/image for TMCH. The earlier reported work uses the fixed patch size-based approach which limits the automatic CAD system scope whereas proposed system provides complete solution to CAD system right from automatic tumor patch segmentation to reduction in FPs and final representation of mammogram with TP marked on it. It will drastically reduce the radiologist work by location tumor directly on mammogram.

5. Conclusion

A fully automatic CAD system, which can accurately locate the tumor on a mammogram and reduces FPs, has been proposed. The developed CAD system consists of preprocessing, SOM clustering, ROI extraction, sparse LBP feature computation based on sparse Curvelet coefficients, and finally, FP reduction using ANN classifier.

The proposed algorithm presents a novel concept of extraction of curvelet coefficients according to irregular shape of mass is called as sparse curvelet coefficients and computation of LBP. The analysis proves that the FPs are reduced significantly from 0.85 to 0.01 FP/image for MIAS, 4.81 to 0.03 FP/image for DDSM and 2.32 to 0.00 FP/image for TMCH. The ANN classifier showed best results as AUC = 0.98 and accuracy = 98.57% for MIAS in benign-malignant classification, AUC = 0.98 and accuracy = 98.70% for DDSM in normal-malignant classification, AUC = 0.98 and accuracy = 98.30% for TMCH: Scanner1, and AUC = 1 and accuracy = 100% for TMCH: Scanner2 in normal-malignant classification as compared with SVM and KNN classifier. The performance of LBP features and LBP features based on sparse curvelet coefficients are nearly same which show that the proposed algorithm is suitable for cancer breast tissue diagnosis.

In future, the reduced curvelet coefficients can be used to extract local ternary patterns and other local descriptor and local directional patterns, etc. The present work deals with mammogram with single mass; this can be further extended for multiple mass models with multiple LBP features based on sparse curvelet coefficients.

Data Availability

In this research, we have used two publicly available datasets MIAS and DDSM. These datasets can be found here in [17] and [19]. The third database is collected from the local hospital Tata Memorial Cancer Hospital, Mumbai, which can be found at http://eureka.sveri.ac.in/ or available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The TMCH database for this work was given by Department of Radiodiagnosis, Tata Memorial Cancer Hospital, Mumbai.