Abstract

Cancer classification by doctors and radiologists was based on morphological and clinical features and had limited diagnostic ability in olden days. The recent arrival of DNA microarray technology has led to the concurrent monitoring of thousands of gene expressions in a single chip which stimulates the progress in cancer classification. In this paper, we have proposed a hybrid approach for microarray data classification based on nearest neighbor (KNN), naive Bayes, and support vector machine (SVM). Feature selection prior to classification plays a vital role and a feature selection technique which combines discrete wavelet transform (DWT) and moving window technique (MWT) is used. The performance of the proposed method is compared with the conventional classifiers like support vector machine, nearest neighbor, and naive Bayes. Experiments have been conducted on both real and benchmark datasets and the results indicate that the ensemble approach produces higher classification accuracy than conventional classifiers. This paper serves as an automated system for the classification of cancer and can be applied by doctors in real cases which serve as a boon to the medical community. This work further reduces the misclassification of cancers which is highly not allowed in cancer detection.

1. Introduction

Despite the skill of the doctors or radiologists, there is always a possibility of missing the detection of cancers by using image modalities for the detection of cancers such as breast, lungs, and colon. This is due to various reasons including technical issues in capturing the images, unobservable abnormalities, and misinterpretation of abnormalities. We hereby propose an automated classification system to reduce the diagnosis error using microarray data. The proposed system has the capability to distinguish between normal and abnormal cases of different cancers based on discrete wavelet transform (DWT).

Microarray is the technology for measuring the expression levels of tens of thousands of genes in parallel in a single chip [13]. Each chip is about 2 cm by 2 cm and microarrays contain up to 6000 spots. Different microarray technologies include serial analysis of gene expression (SAGE), nylon membrane, and Illumina bead array [4]. Thus microarrays offer an efficient method of gathering data that can be used to determine the expression pattern of thousands of genes. Gene expression data is represented as a matrix in which rows represent genes and columns represent samples or observations. High dimensionality of gene expression data is a big challenge in most of the classification problems. Large number of features (genes) against small sample size and redundancy in expressed data are the two main causes which lead to poor classification accuracy.

Several classification techniques have been employed in the past to deal with microarray data. Reference [5] used a weighted voting scheme, [6] applied support vector machines, and [7] explored several support vector machine techniques, nearest neighbor classifier, and probabilistic neural networks. It has been found that no classification algorithm performs well on all datasets and hence the exploration of several classifiers is useful [8].

Feature selection prior to classification is an essential task. Feature selection methods [9] remove irrelevant and redundant features to improve classification accuracy. Many transformation methods such as independent component analysis [10] and wavelet analysis [11] have also been applied to reduce the dimension of the data in the past. In [12], a feature extraction method based on discrete wavelet transform (DWT) is proposed. The approximation coefficients of DWT together with some useful features from the high frequency coefficients selected by the maximum modulus method are used as features. A novel way to think of microarray data is as a signals set. The number of genes is the length of signals and hence signal processing techniques such as wavelet transform can be used to perform microarray data analysis.

This paper deals with the classification of particular microarray data into normal or abnormal based on discrete wavelet transform (DWT). DWT is an important multiresolution analysis tool that has been commonly applied to signal processing, image analysis, and various classification systems [13]. A novel moving window technique (MWT) is applied for feature extraction and hybrid classifier based on nearest neighbor (NN), naïve Bayes, and support vector machine (SVM) classifier is used for classification purpose. The proposed methods are implemented in MATLAB and the performances of these methods are also analyzed.

The rest of the paper is organized as follows. Microarray and wavelet transforms theoretical framework is discussed in Section 2. Section 3 describes the development methodology for the classification of microarray data. The classification is achieved by extracting wavelet features based on the proposed MWT. The classifiers used in the proposed methods are NN, Bayes, and SVM. Datasets are described in Section 4. Section 5 analyzes the experimental results and discussions of the proposed method and finally we draw our conclusions in Section 6.

2. Background

2.1. Microarray

Large amount of data useful for solving many biological problems can be generated by a technique called microarray. Microarray is a technique which measures the level of activity of thousands of genes concurrently. If the gene is overexpressed then there will be too much protein which gives the conclusion that the particular gene is abnormal. Even much smaller changes can be detected by microarrays compared to karyotypes. The domain where microarray is used in the recent years is in disease classification. Gene expression data is data rich and information poor. Public microarray databases include NCBI, Genbank, Array Express, Gene Expression Omnibus, and Stanford Microarray [14]. Microarray platforms include Agilent, Affymetrix, and Illumina bead array. Table 1 depicts the gene expression matrix where each cell in the matrix represents the normalized gene expression level.

2.1.1. Steps in Microarray

Listed below is the protocol for microarray technology. (1)Collect samples from healthy and cancer patients and keep the samples in two separate tubes. (2)Isolate RNA followed by mRNA isolation from healthy and cancerous samples. (3)Convert to complimentary DNA (cDNA) with healthy cDNA green in color and cancerous cDNA green in color. (4)Apply cDNA from both samples to the microarray. (5)Scanning the microarray will tell the difference between infected cell and uninfected cell.The red spots represent genes that are turned up and the green spots represent genes that are turned down. More mRNA is produced in red spots and less mRNA is produced in green spots. Red spots represent cancer cells and green spots represent normal cells and the yellow spots represent genes that are expressed in both cancer and healthy patients.

2.2. Discrete Wavelet Transform

Discrete wavelet transform is a signal processing method by which gene expression data is processed. Wavelet transform is used in gene expression analysis because of its multiresolution approach in signal processing. In this microarray data is transformed into time-scale domain and used as classification features. Gene expression data is represented by a matrix in which rows represent genes and columns represent samples. Since each sample contains thousands of genes, the number of genes can be viewed as the length of the signals. Hence signal processing techniques can be used for microarray data analysis. Wavelets are a family of basis functions. symlet, coiflet, Daubechies, and biorthogonal and reverse biorthogonal wavelets are the wavelet families which are already in use [11]. They vary in various basic properties of wavelets like compactness, smoothness, fast implementation, and orthonormality. One of the key advantages of wavelets is their ability to spatially adapt to features of a function such as discontinuities and varying frequency behavior. The compact support means the localization of wavelets. That is, a region of the data can be processed without affecting the data outside this region.

3. The Proposed Method

The main objective of the proposed classification system is to distinguish between the normal and abnormal microarray data of different types of cancer. Five real world microarray datasets used by many researchers are taken to evaluate this study. The two different stages involved in the proposed classification system are feature extraction and classification. They are discussed in the following sections. Figure 1 shows the block diagram of the proposed system.

3.1. Feature Selection

A novel feature extraction technique based on DWT and MWT is proposed. Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. DWT can be used for high dimensionality data analyses, such as image processing and image data analysis. The proposed method rearranges the data giving a threshold to the wavelet coefficient using DWT and then calculates the approximate value of the raw data after applying an inverse function to the transformed data. After t-test, feature selection method is applied to the selected feature at the same approximation value. After ranking the wavelet coefficients inside the window, the top ranked wavelet coefficient is selected as a dominant feature of that window. A similar process is applied for all the windows placed in given microarray data and the resultant top ranked coefficients are stored in the database for further classification. In this paper, the algorithm based on DWT used to perform effective feature selection [9] is shown in Algorithm 1.

Input: dataset
Output: A subset (selected features by filter method) of the dataset
Method:
(1) Decomposition Step ( : array[1,…, ] of reals)
(2) for to ;
(3) [ ] ( [ ] + [ ])/sqrt(2); //scaling coefficient
(4) [ ] ( [ ] − [ ])/sqrt(2); //detail coefficient
(5) end for
(6)
(7) Threshold application
(8) if [ ] <= 0.5 then [ ] = 0; // threshold value = 0.5
(9) else [ ] = [ ];
(10) Approximation value using the inverse function
Wavelet inverse = new wavelet(“scaling coef”, “detail coef”);
(11) Feature selection using the filter method
-test; //rank of -test

The algorithm works as follows. First the window size is defined. Then for each window apply the wavelet transform and define the level of decomposition. It divides into two subbands, namely, scaling coefficient and detailed coefficient, which are called wavelet coefficients. Rearrange the data giving a threshold to the wavelet coefficient and then calculate the approximate value of the raw data after applying an inverse function to the transformed data. -test is applied to select the top ranking features (genes). The algorithm is represented in Algorithm 1.

3.2. Classification

Classification of microarray data into normal and abnormal is done by designing hybrid classifier based on NN, naive Bayes, and SVM. In the classification stage, the same kinds of features are extracted and compared with the references obtained in the training stage. Cross-validation is used to estimate how a machine learning algorithm will perform when faced with unfamiliar data. It is intended to reduce error associated with one of the pit falls of machine learning where a hypothesis is formed on the same data to test it. In K-fold cross-validation the data is randomly divided into K partitions. Data in one partition is used to test and the remaining partitions are used to train. This means that the training data needs to be calculated K times as each partition gets tested. In order to evaluate the robustness of the proposed system threefold cross-validation is used. Every fold is used to test the accuracy by using one classifier each. The classification accuracy obtained from each of the classifiers is considered as the weight of the same while the classifiers are hybridized.

4. Datasets

To demonstrate the efficiency of our method, the proposed method is evaluated on five gene expression datasets used in the literature. (1)Colon dataset: colon dataset [15] is derived from colon cancer patient samples. It consists of the expression levels of 1909 genes of 62 patients among which 40 are colon cancer cases and 22 are normal cases. (2)Ovarian dataset: ovarian dataset [16] often serves as benchmark for microarray data analysis in most literatures. The dataset provided here includes 91 controls (normal) and 162 ovarian cancers. There are a total of 15,154 genes. (3)CNS dataset: CNS dataset [15] contains 60 patient samples out of which 39 are normal cases and 21 are cancer cases. There are 7129 genes in the dataset. (4)Leukemia dataset: leukemia dataset [6] is another dataset widely used in the literature, which is taken as our benchmark dataset. The leukemia dataset contains the expression levels of 7129 genes taken from 72 samples. Labels indicate that there are 47 cancer cases and 25 normal cases. (5)Breast dataset: breast dataset [17] consists of 97 samples, of which 51 are normal cases and 46 are cancer cases. Large number of features against small sample size is the trademark of the breast cancer dataset. Table 2 shows the summary of these datasets.

5. Results and Discussion

In this section, the experimental results and their implications are discussed. Here the performance of the proposed classification system of cancerous microarray data based on DWT and hybrid classifier is explained. To evaluate the performance of the proposed system, computer simulations and experiments with microarray data are performed. The system is implemented in MATLAB version 7.6. The training and testing are run on a modern standard PC (1.66 GHz Intel processor, 1 GB of RAM) running under Windows XP. The metric used to analyze the performance of the proposed system is classification accuracy. In this study, 4 datasets having large number of genes and 1 dataset with minimum number of genes are considered.

As the microarray dataset has different number of samples, 60% of cases are selected for training the classifier and the remaining 40% are used for classification using hybrid classifier. The robustness of the proposed system is evaluated by changing three parameters, namely, different types of wavelet, decomposition level of DWT, and also the window size of the proposed MWT. The types of wavelets used are Daubechies 7 (db7), coiflet 2 (coif2), biorthogonal 2.2 (bior2.2), Symlet 2 (sym2), and reverse biorthogonal 2.2 (rbio2.2), respectively (Coifman and Wickerhauser, 1992). The Daubechies family bases are chosen because of their properties of compact support and orthonormality. The biorthogonal wavelets are chosen for their property of exact reconstruction. The window sizes used in the proposed MWT are 32, 64, 128, 256, 512, and 1024. The performance analysis starts from one-level decomposition to maximum level of decomposition of window size using the wavelet.

Tables 3, 4, 5, 6, and 7 show the window sizes and the various levels in which maximum accuracy is achieved for the proposed system using different types of wavelets for the five benchmark datasets used in our study.

Table 3 reports the classification performance of the individual classifiers with the hybrid classifier for colon dataset using db7. The statistically significant results are in bold face. It is observed that KNN, Bayes, and SVM produce accuracy less than 76 whereas the accuracy of the proposed hybrid classifier for the same dataset is 100% with minimal number of features (four) and for window size 512.

Individual classifiers show considerably very poor performance compared to our hybrid classifier for the breast dataset. The maximum accuracy in terms of individual classifier is 64.44 for KNN in level 1 with a window size of 128. Our ensemble approach produces 100% accuracy with just 24 features for level 4 using window size of 1024. It is obvious from Table 5 to Table 7 that hybrid classifier gives the maximum accuracy compared to the individual classifier.

Table 8 shows the parameter value at which the proposed MWT produces the maximum classification accuracy without any misclassification. Figures 2, 3, 4, 5, and 6 show the classification accuracy of the proposed system for breast, colon, ovarian, CNS, and leukemia microarray dataset using the best window size that achieves 100% classification accuracy.

From the figures and Table 8, it is observed that the proposed MWT produces better results with no misclassification and minimum numbers of features are used for the classification.

6. Conclusion

In this paper we have proposed a new technique for the classification of microarray data based on DWT. The multiresolutional representation of microarray data is achieved by DWT inside a window of predefined size as test pattern. The performance of the proposed system is assessed by means of extensive computational tests concerning the classification of five cancer microarray datasets: breast, colon, ovarian, CNS, and leukemia. Experimental results show that the proposed method is successful in classifying the microarray data and the successful classification rate is 100% for all five microarray datasets. It is observed from the results that DWT emerges as a potentially dominant feature extraction technique for microarray data classification.

Disclosure

Chilambuchelvan Arul Ganaprakasam and Kannan Arputharaj are coauthors of this paper.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.