Advances in Fuzzy Systems

Advances in Fuzzy Systems / 2016 / Article
Special Issue

Forefront of Fuzzy Logic in Data Mining: Theory, Algorithms, and Applications

View this Special Issue

Research Article | Open Access

Volume 2016 |Article ID 6134736 | 19 pages | https://doi.org/10.1155/2016/6134736

An Improved Fuzzy Based Missing Value Estimation in DNA Microarray Validated by Gene Ranking

Academic Editor: Gözde Ulutagay
Received22 Mar 2016
Accepted16 Jun 2016
Published18 Jul 2016

Abstract

Most of the gene expression data analysis algorithms require the entire gene expression matrix without any missing values. Hence, it is necessary to devise methods which would impute missing data values accurately. There exist a number of imputation algorithms to estimate those missing values. This work starts with a microarray dataset containing multiple missing values. We first apply the modified version of the fuzzy theory based existing method LRFDVImpute to impute multiple missing values of time series gene expression data and then validate the result of imputation by genetic algorithm (GA) based gene ranking methodology along with some regular statistical validation techniques, like RMSE method. Gene ranking, as far as our knowledge, has not been used yet to validate the result of missing value estimation. Firstly, the proposed method has been tested on the very popular Spellman dataset and results show that error margins have been drastically reduced compared to some previous works, which indirectly validates the statistical significance of the proposed method. Then it has been applied on four other 2-class benchmark datasets, like Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and DLBCL-FL (Leukaemia) for both missing value estimation and ranking the genes, and the results show that the proposed method can reach 100% classification accuracy with very few dominant genes, which indirectly validates the biological significance of the proposed method.

1. Introduction

Microarray expression analysis is a widely used technique for profiling mRNA expression. The mRNA carries genetic information from DNA to the ribosome, where they specify the amino acid sequence of the protein products of gene expression. Microarray datasets often contain missing values which may occur due to various reasons including imperfections in data preparation steps (e.g., poor hybridization and chip contamination by dust and scratches) that create erroneous and low-quality values, which are usually discarded and referred to as missing. It is common for gene expression data to contain at least 5% missing values [1]. Most of the microarray data analysis algorithms, such as gene clustering, disease (experiment) classification, and gene network design, require the complete information, that is, the entire gene expression matrix without any missing values. Hence, different imputation techniques should be used which would accurately impute multiple missing data values. Numerous imputation algorithms have been proposed to estimate the missing values. At first, we have applied modified version of our existing imputation technique LRFDVImpute [2] that first finds a subset of similar genes using the fuzzy difference vector (FDV) algorithm used in [3] where gene expression profiles have been considered as continuous time series curves and then use linear regression on the subset to estimate the missing value. We have considered estimating only those genes with one, two, or three missing values since these genes constitute 5–10% of the entire dataset. Absolute error has been calculated from the difference between the original value and the estimated value. Root Mean Square Error (RMSE) of those absolute errors is then determined.

The workflow for the first phase has been shown in Figure 1.

After that we rank those genes to find the top ranked genes [4]. We have used a hypothesis test, Wilcoxon rank sum test [5], to sort the features (genes) and rank them in order of their values and select top genes from them, thereby reducing the dimensionality, where is the population size that has been used later for GA. The reduced set of genes has then been ranked by our GA method. The two ranks, one by Wilcoxon method and the other by our GA method, are then compared. The top genes (value of defined by the user) selected by our method are then used for classification using support vector machine (SVM) classifiers. The performance of classification justifies the efficiency of the ranking method used. Figure 2 shows the workflow for this phase.

Once this is done, we then forcibly make some cells missing in the top ranked genes and again estimate them using the same missing value estimation technique. Finally, we rank them once more to find the top ranked genes. Results show that most of the top ranked genes remain the same, which validates the proposed missing value estimation technique biologically as far as the estimation is concerned.

2. Present State of the Art

As discussed earlier, various statistical and analytical methods used for gene expression analysis are not robust to missing values and require the complete gene expression matrix for providing accurate results. Hence, it is necessary to devise accurate methods which would impute data values when they are missing. Many imputation methods have been proposed. The earliest method, named as row averaging or filling with zeroes, used to fill in the gaps for the missing values in gene dataset with zeroes or with the row average.

KNNImpute method proposed in [1] selects genes with expression profiles similar to the gene of interest to impute missing values. After experimenting with a number of metrics to calculate the gene similarity, such as Pearson correlation, Euclidian distance, and variance minimization, it was found that Euclidian distance was a sufficiently accurate norm.

The SVDImpute method, proposed in [1], uses Singular Value Decomposition of matrices to estimate the missing values of a DNA microarray. This method works by decomposing the gene data matrix into a set of mutually orthogonal expression patterns that can be linearly combined to approximate the expression of all genes in the dataset. These patterns, which in this case are identical to the principle components of the gene expression matrix, are further referred to as eigengenes [6, 7].

Another method named as LLSImpute [8] represents a target gene with missing values as a linear combination of similar genes. The similar genes are chosen by -nearest neighbours or coherent genes that have large absolute values of correlation coefficients followed by least square regression and estimation.

BPCAImpute method, proposed in [9], uses a Bayesian estimation algorithm to predict missing values. BPCA suggests using the number of samples minus 1 as the number of principal axes. Since BPCA uses an EM-like repetitive algorithm to estimate missing values, it needs intensive computations to impute missing values.

Another algorithm for time series gene expression analysis is presented in [10] that permits the principled estimation of unobserved time points, clustering, and dataset alignment. Each expression profile is modelled as a cubic spline (piecewise polynomial) that is estimated from the observed data and every time point influences the overall smooth expression curve. The alignment algorithm uses the same spline representation of continuous time series gene expression profiles.

FDVImpute method, proposed in [11], incorporates some fuzziness to estimate the missing value of a DNA microarray. The first step selects nearest (most similar) genes of the target gene (whose some component is missing) using fuzzy difference vector algorithm. Then the missing cell is estimated by using least square fit on the selected genes in the second step.

FDVSplineImpute, presented in [3], takes into account the time series nature of gene expression data and permits the estimation of missing observations using B-splines of similar genes from fuzzy difference vectors.

Another method, LRFDVImpute, proposed in [2], estimates multiple missing observations by first finding the most similar genes of the target gene and then applying the linear regression on those similar genes. This approach works in two stages. At the first stage, it estimates the real missing cells of SPELLMAN_COMBINED dataset and at the later stage, it makes some cells miss forcefully of the same dataset and then using the estimated results from the first step, this approach estimates those missed cells using the same approach used earlier. Absolute error has been calculated from the difference between the original value and the estimated value. Root Mean Square Error (RMSE) of those absolute errors is then determined.

Extracting relevant information from microarray data is also difficult because of the inherent characteristics of the datasets, where there are the thousands of variables (genes) and very few numbers of samples. Finding out the set of significant genes or, in other words, the most differentially expressed genes, by studying data from tissues affected or unaffected by cancer cells, is an important task. This problem can be termed as gene selection. Several techniques have been used to rank genes and find out the most significant ones.

In [12], the algorithm used discriminant partial least squares (DPLS) and fuzzy clustering methods to interpret the gene expression patterns of acute leukemia and identify leukemia subtypes.

In [13], the proposed method used Mann-Whitney test and -sample Kruskal-Wallis ANOVA test to rank genes. Dimension reduction was done using -means clustering and PCA and classification performed using ANN trained during 8-fold cross-validation with recursive feature elimination (RFE) and leave-one-out testing.

In [14], the algorithm proposed a gene selection method based on Wilcoxon rank sum test and SVM. Wilcoxon rank sum test was used to select a subset of genes and then each selected gene is trained and tested using SVM classifier with linear kernel separately, and genes with high testing accuracy rates were chosen to form the final reduced gene subset. Classification was performed on two datasets: Breast Cancer [15] and ALL/AML Leukemia [16] using leave-one-out cross-validation (LOOCV).

A hybrid GA/SVM approach is proposed for gene selection in [17], where a fuzzy logic based preprocessing tool is used to reduce dimensionality, GA for finding out the most frequent genes, and a SVM classifier used for classification. Experiments were performed on two well-known cancer datasets, Leukemia [16] and Colon [18], and results were compared with six other methods.

A multiobjective genetic approach is proposed in [19] for simultaneous clustering and gene ranking where a method to simultaneously optimize the feature ranking and clustering has been used. NSGA-II (Nondominated Sorting Genetic Algorithm-II) [20] has been used as a multiobjective evolutionary algorithm to optimize the chromosomes.

In [21], the proposed algorithm uses feature selection method based on genetic algorithms (GAs) and classification methods focusing on constructive neural networks (CNNs), C-Mantec. Several comparison results on six public cancer databases are provided using other feature selection strategy (Stepwise Forward Selection method) and different classification techniques (LDA, SVM, and Naive Bayes).

A PSO based graph theoretic approach, proposed in [22], is used for identifying the nonredundant gene markers from microarray gene expression data. The microarray data is first converted into a weighted undirected complete feature graph where the nodes represent the genes having gene’s relevance as node weights and the edges are weighted in order of correlation among the genes. The densest subgraph having minimum average edge weight (similarity) and maximum average node weight (relevance) is then identified from the original feature graph. Binary particle swarm optimization is then applied for minimizing the average edge weight (correlation) and maximizing the average node weight (gene relevance) through a single objective function.

A web based tool DWFS, proposed in [23], is used to select significant features for a variety of problems efficiently. The search strategy is implemented using Parallel Genetic Algorithm. DWFS also applies various filtering methods as a preprocessing step in the feature selection process. It also uses three classifiers, like KNN classifier, Naive Bayes Classifier, and the combination of these two. Experiments using datasets taken from different biomedical applications show the efficiency of DWFS and lead to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods.

3. Proposed Method

3.1. Missing Value Estimation Using Linear Regression

This phase of the work modifies an existing method LRFDVImpute for estimating missing values present in the microarray dataset using linear regression. Earlier version of LRFDVImpute inserts the newly estimated gene into the training data after estimation of each target gene. In this way, the newly estimated gene is taken into consideration while estimating the next target gene. This process has the risk of increasing the error while estimating the subsequent genes since the error term is cumulatively multiplied. To overcome this problem, modified LRFDVImpute does not add the target gene to the training data after it has been estimated. This way, the training gene set size remains constant and with increasing membership values of , the size of training data reduces. The effects of modifications have been studied and results are shown in the experimental results section. In our problem, the genes with missing values in the () ( is the number of genes and is the number of samples) dataset are to be estimated. The method of finding a similar gene as used in [3] using fuzzy difference vector (FDV) algorithm is described below.

Target Row/Testing Data. The row whose missing value is being estimated: a target row may have multiple missing values but in a single run, a single value is estimated.

Similar Rows/Training Data. The rows that are similar to the target row: in this case only those rows are selected that have no missing values. Before applying the similarity measures all the columns from the complete matrix are removed that correspond to missing values in target row.

Let be the set of genes in the dataset. Let th be the target gene, that is, the gene with missing values. We remove the columns having missing values from the entire dataset. Let the resultant matrix contain () columns. Each target gene is compared with each of the similar rows in the dataset. For the th gene , the difference vector of is calculated as follows:

Once the difference vectors are calculated for each of the target rows and the similar rows, say (for target row) and (for similar row), we then calculate to obtain the number of matches between difference vectors and for each target gene . A match in the th component of the vectors and is determined by whether the signs of and are the same or not. defines the degree of match between the distribution of the target gene and the similar gene. We then define a membership grade for as follows:The genes in the training data that have a membership value greater than a chosen membership grade are considered to be a part of the similar genes.

The steps for estimation can be summarized below:(1)Load the dataset with missing values.(2)Calculate the missing number of columns for each gene and start with the first row with the least number of missing values (for our dataset it is 1).(3)Compute the corresponding membership grade for the target gene from the training data using the FDV algorithm as shown above.(4)Estimate the missing value using linear regression.(5)Obtain coefficients of the regression from the linear model object lmObj.(6)Add a bias of 1 at the beginning of the target row to allow for the bias parameter.(7)Perform a vector multiplication between the modified target row and the coefficients of regression and add the obtained vector’s elements together to get the estimated value.(8)Replace the missing value with the estimated value.(9)Go to step and repeat the above steps to fill the missing values unless the mentioned “least number of missing values” in step is less than or equal to 3.

Although we mentioned here that we go on filling the missing value till a point, it is not true. In between we stop this filling in process to do assessment of our algorithm.

After we have filled in all the missing values corresponding to rows with single missing values we select a particular collection of row-column positions corresponding to rows that did not have missing values initially and deliberately treat the values at these positions as missing and use the exact same process to estimate the values.

The same collection of row-column positions are again used when the algorithm has filled up all the rows up to two missing vales and then when it has filled up missing values existing in rows with up to three missing values.

3.2. Gene Ranking Using Genetic Algorithm

In phase 2 of the proposed work, the result of the missing value estimation procedure carried out in phase 1 is biologically validated by ranking the genes using GA. Since a characteristic of gene expression microarray data is that the number of variables (genes) far exceeds the number of samples , we must reduce its dimension. Executing GA on the original dataset is quite impractical and time consuming. As a preprocessing step, we have reduced the dimension using Wilcoxon rank sum test.

3.2.1. Dimension Reduction Using Wilcoxon Rank Sum Test (WRST)

The inputs to the Wilcoxon rank sum test function are the two gene sets, the diseased set and the normal set, both of which have individually undergone the missing value estimation procedure (if there was any missing value). The two gene sets may have different number of samples. Let us consider that the diseased set is a () sized gene expression data, where is the number of genes and is the number of samples, and the normal set has a size (), where is the number of samples. The Wilcoxon rank sum function processes the two datasets in order to find out for which genes the null hypothesis is accepted or rejected. It returns two values, value and -value, as discussed earlier. The null hypothesis for our problem is that the genes are not differentially expressed; that is, either all the samples have come from diseased patients or they have come from normal patients. The alternative hypothesis can be that genes are differentially expressed. We record the values and -values for each gene.

In the next step, we consider only those genes for which the alternative hypothesis holds () at the significance level alpha and sort the genes according to the values thereby ranking the genes. We then select the topmost genes, where is the population size that has been used for GA later. Thus, we have two reduced populations, one representing diseased and the other representing normal tissues. Let be the diseased set, where is the reduced set of genes and is the number of samples, respectively, and let be the normal set, where is the number of samples.

3.2.2. Chromosome Representation and Initial Population for GA

The reduced gene sets and serve as the initial population for the genetic algorithm step. They contain pop_size number of genes which is preselected by the user. We use real value encoding to represent each chromosome; that is, and are the measurements recorded for the th gene and th sample for each population, respectively.

3.2.3. Fitness Calculation

The fitness for each gene in the reduced gene sets is again calculated by a method similar to that used in [14] where gene expression profiles have been considered as continuous time series curves.

In our problem, we have two populations, one for the diseased tissues and the other for the normal tissues. The two populations contain the same number of genes but may have different number of samples. In that case, we consider the minimum of the two and extract the same number of samples from each set.

Let be the reduced set of genes in each population. If , then for each population, the difference vector of is calculated using (1). Once the difference vectors are calculated for each of the two populations, say (for diseased) and (for normal), the number of matches between the difference vectors and the membership grade for is computed using (2).

The fitness of gene is the reciprocal of and is calculated asThis signifies that the more similar the distributions of gene in the two populations are, the less differentially expressed the gene is, and vice versa. Thus, a fitter gene will have different distributions in the two populations. We then rank the genes in order of their fitness.

3.2.4. Elitism

We have used an elitist version of GA where the best chromosomes are carried forward to the next generation unchanged; that is, the crossover and mutation operators are not applied on the best chromosomes. This technique ensures faster convergence of the process by keeping track of the best solutions.

3.2.5. Selection

For selection, we have used a roulette wheel technique where genes are selected based on their relative fitness values. The better the chromosomes are, the more chances to be selected they have. Let count be the number of elite children. We construct a roulette wheel as follows [22]:(i)Calculate the fitness value for each chromosome , .(ii)Find the total fitness of the population = .(iii)Calculate the probability of selection for each chromosome , :(iv)Calculate a cumulative probability for each chromosome , :We now spin the wheel (pop_size − count) times and select a single chromosome as follows:(i) Generate a random number (float) between 0 and 1.(ii) If , we select the first chromosome ; otherwise, select the th chromosome () such that .Some chromosomes get selected more than once. According to Schema Theorem [24], the best chromosomes get more copies, the average stay even, and the worst die off.

3.2.6. Crossover

For crossover, we proceed as follows.

For each chromosome in the population,(i)generate a random number (float) between 0 and 1,(ii)if (crossover probability), we select the given chromosome for crossover.We have used single point crossover where the crossover site is also generated randomly in the range , where is the number of samples. Thus after crossover, a pair of parent chromosomes generates a pair of offspring chromosomes [25]. The new population obtained after crossover contains the new generation produced by crossover as well as the elite children that did not undergo crossover. This new population is used in the mutation process.

3.2.7. Mutation

A nonuniform mutation operator as proposed in literature [25] has been used here. The new operator is defined as follows:

(i) A random experiment is carried out which produces an outcome which is either 0 or 1.(ii)Another random number pos is generated in the range , where is the number of samples, to select the mutation site.(iii) Let ,  , be the chromosome, and let be selected for mutation. Domain of is ; the resultant vector :where is the generation number and the function returns a value in the range such that the probability of being close to 0 increases as increases. This property causes this operator to search the space uniformly initially (when is small) and very locally at later stages.

is calculated aswhere is a random number in the range , is the maximum number of generations preselected by the user, and is a system parameter determining the degree of uniformity. We have used for our experiment.

The entire genetic transformation has been performed on one population with respect to the other. We made the diseased gene set to undergo genetic transformation while fitness evaluation has been made with respect to the normal gene set. The opposite transformation will produce similar results.

Once the genetic transformations are done, we obtain a final population set (here, genetically transformed diseased gene set) which have been ranked in order of their fitness. We compare the two ranks, one by the Wilcoxon method and the other by our GA method. A threshold of 2 has been considered while comparing the two ranks. Results show that there is a good percentage of matches in the two ranks. Moreover, we find out the top ranked genes produced by both methods and the significant genes produced by the two methods are also similar. This also validates the result of the missing value estimation method carried out in phase 1.

3.3. Gene Classification Using SVM

In order to prove the significance of ranking by our GA method, we perform classification. The top ranked genes, n’ , ranked by our GA method are used for the purpose. We use -fold LOO cross-validations, where is varied from one dataset to another depending on the number of samples. For cross-validation, we have divided our dataset into two sets, a training set and a testing set, in 80 : 20 ratio. The reason behind taking this ratio is that 80 : 20 is a commonly occurring ratio, which is often referred to as Pareto Principle. So, if there are samples in the training set and samples in the test set, where is the total number of samples, the training set is divided into equal sized subsets. Of the subsets, one subset is retained for validation and the remaining subsets are used as training data. Thus, SVM classifiers with linear kernel are trained using the training subsets. The classification accuracy rates are recorded and the classifier with the best accuracy rate is used to test the samples.

4. Experimental Results

4.1. Datasets Used

The missing value estimation part of the proposed modified LRFDVImpute technique has been evaluated on the publicly available yeast cell cycle time series dataset from Spellman et al. [26] described in Table 1.


DatasetStartEndSamplingComplete genes

alpha0 m119 mEvery 7 m4489

cdc1510 m290 mEvery 20 m for 1 hr,
10 m for 3 hr, and
20 m for final hr
4381

cdc280 m160 mEvery 10 m1383

elu0 m390 mEvery 30 m5766

Yeast Saccharomyces cerevisiae dataset of Spellman et al. [26].
Source: http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt.
Organism: yeast.

After the experiments on Spellman dataset are done, the combined gene ranking and classification portion of the proposed method are evaluated on four publicly available datasets: Colorectal Cancer tumours dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and Leukaemia Cancer dataset (DLBCL-FL).

4.2. Platform Used

All algorithms have been implemented using MATLAB R2013a in Windows 8.1.

4.3. Results
4.3.1. Results of Missing Value Estimation Part

We perform the initial estimation using modified version of LRFDVImpute with a membership grade . After the initial estimation is over, we forcibly treat cells at specified locations as missing and estimate them using different membership values of and both earlier and modified versions. This has been carried out only once, after estimating rows with single missing values and the corresponding RMSE values have been recorded. We have performed our experiments only on alpha, cdc15, and elu data of Spellman dataset. The number of missing values is too large for cdc28; that is why we ignore that segment. The results for the alpha, cdc15, and elu datasets using both methods are shown in Tables 24. Figures 35 show the corresponding plots of RMSE versus membership grade for each of the four datasets.


RMSEθ0.40.450.50.550.60.650.7

Original LRFDVImpute0.0124053440.0124881810.0125627820.0125627820.0126909040.0121973740.012638865
Modified LRFDVImpute0.0124399360.0124393660.0123898720.0123898720.0126454660.0119882630.013268721


RMSEθ0.40.450.50.550.60.650.7

Original LRFDVImpute0.0168320940.0167609680.0167061190.0166824180.0167688370.0167336420.049482242
Modified LRFDVImpute0.0167812570.0167103180.0167366130.0167233490.0166377530.0170236710.057225615


RMSEθ0.40.450.50.550.60.650.7

Original LRFDVImpute0000000
Modified LRFDVImpute0000000

Table 5 compares the performance of both versions of LRFDVImpute method to that of some other existing methods, like SVDImpute, LLSImpute, FDVLLSImpute, FDVSPLINEImpute, and so forth, and the results show that modified version of LRFDVImpute outperforms the other existing methods as far as RMSE value is concerned.


DatasetSVDImputeLLSImputeFDVLLSImputeFDVSPLINEImputeFDVLRImpute with
Original LRFDVImputeModified LRFDVImpute

alpha0.033950.078530.0960.0630.0125627820.012389872
cdc150.050550.12080.2580.1270.0166824180.016723349
elu0.015850.00330.044.01900

4.3.2. Combined Results

We test the significance of our proposed missing value estimation technique using the gene ranking method. We have not found any state-of-the-art work on gene ranking so far where Spellman dataset is used. That is why we use four more publicly available real-life gene expression datasets, like Colorectal Cancer dataset (GDS4382), Breast Cancer dataset (GSE349-350), Prostate Cancer dataset, and Leukaemia Cancer dataset (DLBCL-FL) [4, 2732], to perform steps such as missing values estimation and gene ranking and analyze the results. We start with the microarray dataset containing missing values and apply our proposed missing value estimation technique to estimate the genes with missing values (if any). We rank them using proposed gene ranking method and find the top ranked genes. We then forcibly insert missing values in the top ranked genes and again estimate them using the same missing value estimation technique. Finally, we rank them once more to find the top ranked genes. Results show that most of the top ranked genes remain the same, which implies that the proposed missing value estimation technique has been accurate in estimating the unknown values. We have normalized most of the datasets using -score normalization method in order to bring the data values to a common scale.

Tables 6, 8, 10, and 13 show the estimated values for the four datasets, Tables 7, 9, 11, and 14 show the common gene indices before and after the estimation, and Tables 12 and 15 compare the performance of the proposed approach with two state-of-the-art methods [22, 23] for Prostate and Leukaemia dataset on the basis of accuracy, sensitivity, specificity, -score, and  -mean metrics. We have found that Prostate and Leukaemia are the common dataset on which both the existing methods have done their experiments. The results show that the proposed gene ranking approach performs far better compared to those existing approaches, where one is a PSO based graph theoretic approach [22] and the other is a web based tool DWFS, which uses KNN and NBC classifiers [23] as far as those metrics are concerned.


Colorectal Cancer GDS4382
CancerNormal
⁢Missing values insertedOld_valueEstimated value with mem = 0.55⁢Missing values insertedOld_valueEstimated value with mem = 0.55
At rowAt columnAt rowAt column

714120.6772523970.703080766714120.140108496−0.163286572
124551.566860791.755118642124551.2129492961.148663192
157830.7875108951.003134829157830.2569983870.22381735
1763111.0247147680.7374148621763110.206810010.056394235
27929−0.861395162−0.86504316427929−1.064597763−1.005338727
402510.7813263431.13753218540251−0.31562626−0.297767551
4134150.8923383080.9585429744134150.3297553420.247541595
50822−0.0063604310.3276354350822−0.480000425−0.370717546
8426131.2102888790.790345743842613−0.0828230220.157094029
997962.9320684012.953449691997962.3878266032.246418246
10083111.3691422691.25841042510083111.9152724831.809541605
10145101.9404675411.85233820310145102.7502105692.834220655
1020831.8685054361.7780846821020832.2244462822.122262802
1028061.4249518611.5217697131028060.986806271.139738659
1032310.8958450320.9635306481032310.3782781850.376819137
10725140.5625342930.90367425810725141.2632426271.285858157
1078941.5822101721.8349612891078940.5076670850.540477481
10855103.177698433.16236083210855102.5622121942.546399932
1105032.6766344862.6577860271105031.8729436931.894214081
1105582.1122619392.1054317481105581.6806971421.713521495
11100162.5227833992.45542931411100163.2084622843.064140274
1146512.4544810562.182116641146511.2324608681.356633277
1148512−0.701537989−0.4977271111485120.6090152460.316407836
1165061.4706626151.356864031165061.9406246382.047625027
1167742.5916358012.6020687141167742.9038305522.934334312


Ranking
RankGene indices prior to missing value insertionGene indices after missing value insertion

1714714
212451245
315781578
417631763
527922792
640254025
741344134
850825082
984268426
1099799979
111008310083
121014510145
131020810208
141028010280
151032310323
161072510725
171078910789
181085510855
191105011050
201105511055
211110011100
221146511465
231148511485
241165011650
251167711677

Number of common genes in top 25 positions = 25
% of common genes = 100


Breast Cancer
CancerNormal
⁢Missing values insertedOld_valueEstimated value with mem = 0.55⁢Missing values insertedOld_valueEstimated value with mem = 0.55
At rowAt columnAt rowAt column

27212−0.354039943−0.4197095222728−0.407057103−0.396601672
3295−0.176687651−0.2959805173292−0.0215841220.092981006
49130.2224861260.30260015491100.1405303630.067353238
869110.796465660.85210004386950.3453609460.279331888
114390.0179564450.256958128114340.054582833−0.145618319
193710.226724640.3394315319377−0.1282577310.05338273
282587.5524413758.108907291282536.0803196825.969522121
300420.1284750640.0440486930046−0.124353092−0.113116385
49114−0.550612392−0.37247207349119−0.090276567−0.152051028
5328610.572282899.556257324532858.3160720217.887609502
618411−0.493537621−0.504575199618480.035382884−0.132574115
7941100.2339742080.14984931794160.0771999160.101298927
845214−0.312716903−0.3299223784523−0.071344506−0.092371778
907660.7089966210.154489296907610.0605546250.036078623
926713−0.270092957−0.33842598926710−0.0349567210.30142636
957470.0931187780.07477271895747−0.227882015−0.159726559
97234−0.018883445−0.279787171972350.4070762370.815509208
975310−0.23265185−0.26490955797538−0.228475796−0.265794544
99053−0.31537376−0.41753921399053−0.286431974−0.199211435
103198−0.049692673−0.038561161103199−0.218073382−0.242557019
106149−0.511734814−0.430792872106142−0.268390734−0.1922339
113771−0.0558099920.083662727113777−0.268295765−0.144527083
1173712−0.34275829−0.269422135117374−0.387083296−0.093221982
1197660.030992978−0.133189906119769−0.270598103−0.272899078
120534−0.374640827−0.3283804831205310−0.360077886−0.244255658


Ranking
RankGene indices prior to missing value insertionGene indices after missing value insertion

130041143
279413004
3103197941
486910319
5532811737
69723491
74919753
89574869
999055328
10117379723
11282512053
1249112825
1384529574
142729905
153294911
1661848452
1790769076
189267329
1997536184
20106149267
211137711976
22119762218
23120532459
2411432995
2519374200

Number of common genes in top 25 positions = 21
% of common genes = 84


Prostate Cancer
CancerNormal
⁢Missing values insertedOld_valueEstimated value with mem = 0.55⁢Missing values insertedOld_valueEstimated value with mem = 0.55
At rowAt columnAt rowAt column

205457.8900851039.434976702205455.1683805245.416011322
28395−0.1762031360.00444796228395−0.173585261−0.183852307
3649100.2964575220.0466540363649100.3299587240.617420849
37948−0.241993865−0.21681793737948−0.106322335−0.109294537
4365170.0598744092.443242853436517−0.122967971−0.099879376
5757140.0902771830.70723886575714−0.196702226−0.067191045
594422−0.180239863−0.186977867594422−0.137411352−0.217659599
6185361.5573450191.548681099618536−0.010802226−0.181261903
646228−0.16997265−0.1757626136462280.2022884990.191383577
7247320.8596432040.9565516397247320.9324146972.21258573
7520180.2706318130.1794295047520180.0649536510.155169008
7557110.3937121940.3759410427557111.4509751331.084180805
776830.0420904520.11563321776830.598759360.576608917
8123470.2169680280.2833308678123470.1280030380.165883246
855440.0151614210.083268762855440.40771258−0.051156112
876826−0.135452978−0.04501778876826−0.138712043−0.188517852
8850172.7080069481.916254815885017−0.364936796−0.228740529
903429−0.190217605−0.178815842903429−0.194249889−0.047431811
9050340.3516343440.1749245639050340.2843894030.469260339
9172422.2770593562.3477041019172421.7276487121.562259352
985050−0.046243874−0.2908560869850501.3714762612.134924482
10138140.5994282280.62839696310138140.7650170831.001711406
1049422−0.189181177−0.12257572710494220.2200221090.365466432
10537480.1393737550.40690644210537480.0350594380.011805144
1095670.3377162430.1552792211095670.1462821560.169249764


Ranking
RankGene indices prior to missing value insertionGene indices after missing value insertion

161856185
21049410494
398504365
443659850
5101389034
6917210138
790345944
859449172
936493649
1085542839
1128397557
12755710956
132059050
1437947520
15109563794
168850205
1775208850
18905010537
19105375757
2057578554
2181238768
2264628123
2387686462
2472477247
2577689093

Number of common genes in top 25 positions = 24
% of common genes = 96


Dataset nameProstate
Number of genesNumber of samplesNumber of samples in Cancer datasetNumber of samples in normal datasetNumber of samples in training datasetNumber of samples in testing datasetSVM kernel usedNumber of folds for LOOCV

12600102525082 (42 cancer, 40 normal)20 (10 cancer, 10 normal)Linear41

Algorithm% accuracySensitivity/recallSpecificity1-score-mean

Proposed approach100 (with top 5 genes)1111
PSO based graph theoretic approach910.910.920.910.91
DWFS using KNN classifier860.870.850.860.86
DWFS using NBC classifier800.760.850.780.80


Leukaemia (DLBCL-FL)
CancerNormal
⁢Missing values insertedOld_valueEstimated value with mem = 0.55⁢Missing values insertedOld_valueEstimated value with mem = 0.55
At rowAt columnAt rowAt column

2850−0.188603483−0.2034772782823.8868115340.84848513
447450.1037989730.30424102644718−0.1462749290.009235547
54681.0557982161.24642013554680.1406974230.320415944
640520.052220359−0.31125274464070.125529330.364757746
91350.157594850.1804119159132−0.135855848−0.165390479
112941.0516582220.5989497621129100.049039280.096097027
114231−0.249901255−0.31177699311421−0.249625598−0.196558516
1293341.910373152.330011528129330.278292690.224522592
155348−0.218067263−0.10700331815539−0.005772680.602953445
1731260.5553122430.39207635173170.4480993190.524633075
2062222.3660470084.452723742062130.2016136110.849245039
2929170.0728861070.125837311292911−0.0074104820.004016071
3965560.137340738−0.905551574396516−0.133906259−0.083702381
3969362.3282558362.60971631396960.5207317240.584352541
4124170.0771700480.18306822141245−0.157303536−0.036843328
413510−0.35554626−0.339641516413510−0.232750407−0.161598404
4143140.185986189−0.02460951841437−0.0327125510.024485269
4233183.9903052531.582890447423380.7708385840.706726475
4313110.4531869710.126025314431315−0.129336617−0.079142717
451022−0.167456263−0.135667934451017−0.046588554−0.100490086
532746−0.415159893−0.324486992532712−0.146934163−0.22684469
612034.4325354994.497409395612030.9193052830.723423487
64177−0.253214481−0.312493286641710−0.141004928−0.007636336
643442−0.220525385−0.208472395643414−0.0016594170.063668162
6756552.01894491.30732193767564−0.0098784390.570901233


Ranking
RankGene indices prior to missing value insertionGene indices after missing value insertion

1447447
29134135
34135913
4546640
529292929
66401142
74510546
839693969
953274510
1067564233
1142335327
1243134313
1361206756
1411426120
1511291553
1617311129
1741246417
1839654124
1912931731
2064341293
21286434
22414328
2320622062
2415534094
2564171984

Number of common genes in top 25 positions = 23
% of common genes = 92


Dataset nameDLBCL-FL
Number of genesNumber of samplesNumber of samples in Cancer datasetNumber of samples in normal datasetNumber of samples in training datasetNumber of samples in testing datasetSVM kernel usedNumber of folds for LOOCV

707077581962 (47 DLBCL, 15 FL)15 (11 DLBCL, 4 FL)Linear31

Algorithm% accuracySensitivity/recallSpecificity1-score-mean

Proposed approach100 (with top 5, 10, 15, and 20 genes)1111
PSO based graph theoretic approach940.950.940.890.94
DWFS using KNN classifier910.970.850.940.9
DWFS using NBC classifier9610.90.980.94

5. Conclusion and Future Scope

The proposed modified version of LRFDVImpute technique has been tested on the dataset from Spellman et al. [26] and has shown impressive results. It outperforms some state-of-the-art methods. The plots of RMSE versus membership grade show that modified version is equivalent to or better than earlier version for the alpha and cdc15 datasets. However, for the cdc28 dataset, earlier version has shown better results. For the elu datasets, both have reached 0 error margin. For both versions, a membership grade between 0.55 and 0.65 produces minimum error and any value in this range can be considered as a threshold to be used for fresh experiments.

The validation of the missing value estimation shows that most of the top ranked genes remain the same, before and after imputation, which implies that the proposed modified LRFDVImpute technique has been accurate in estimating the unknown values.

As a future scope, we would like to analyze the effects of using quadratic regression for estimation of missing values and the use of data cleaning techniques before imputation which may remove outliers if any and may further reduce the error margin. For gene ranking, we wish to analyze the effects of different parameter settings for GA and observe the ranking and classification results using SVM with other kernels and also compare results with the ones mentioned in literature. We would also wish to modify our algorithms so as to make this ranking more efficient and find out the most significant genes that would correctly identify the subtypes of a particular type of cancer. For the Leukemia dataset [16], this could be identifying the B-cell and T-cell lineages for the acute lymphoblastic leukemia (ALL) samples.

Competing Interests

The authors declare that they have no competing interests.

References

  1. O. Troyanskaya, M. Cantor, G. Sherlock et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001. View at: Publisher Site | Google Scholar
  2. S. Saha, P. K. Singh, and K. N. Dey, “Missing value estimation in DNA microarrays using linear regression and fuzzy approach,” in Proceedings of the 4th International Conference on Advances in Computer Science and Application (CSA '15), pp. 62–70, World Scientific, Thiruvananthapuram, India, October 2015. View at: Google Scholar
  3. S. Saha, K. N. Dey, R. Dasgupta, A. Ghose, and K. Mullick, “Anirban ghose, and koustav mullick: missing value estimation in DNA microarrays using B-splines,” Journal of Medical and Bioengineering, vol. 2, no. 2, pp. 88–92, 2013. View at: Publisher Site | Google Scholar
  4. L. C. Crossman, M. Mori, Y.-C. Hsieh et al., “In chronic myeloid leukemia white cells from cytogenetic responders and non-responders to imatinib have very similar gene expression signatures,” Haematologica, vol. 90, no. 4, pp. 459–464, 2005. View at: Google Scholar
  5. Graham Hole Research Skills, The Wilcoxon Test, Version 1.0, 2011.
  6. O. Alter, P. O. Brown, and D. Botstein, “Singular value decomposition for genome-wide expression data processing and modeling,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 18, pp. 10101–10106, 2000. View at: Publisher Site | Google Scholar
  7. G. H. Golub and C. F. V. Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996. View at: MathSciNet
  8. H. Kim, G. H. Golub, and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005. View at: Publisher Site | Google Scholar
  9. S. Oba, M.-A. Sato, I. Takemasa, M. Monden, K.-I. Matsubara, and S. Ishii, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, vol. 19, no. 16, pp. 2088–2096, 2003. View at: Publisher Site | Google Scholar
  10. Z. Bar-Joseph, G. K. Gerber, D. K. Gifford, T. S. Jaakkola, and I. Simon, “Continuous representations of time-series gene expression data,” Journal of Computational Biology, vol. 10, no. 3-4, pp. 341–356, 2003. View at: Publisher Site | Google Scholar
  11. S. Chakraborty, S. Saha, and K. Dey, “Missing value estimation in DNA microarray—a fuzzy approach,” International Journal of Artificial Intelligence and Neural Networks (IJAINN), vol. 2, no. 1, 2012. View at: Google Scholar
  12. C. Yooa, I. B. Leeb, and P. A. Vanrolleghema, “Interpreting patterns and analysis of acute leukemia gene expression data by multivariate fuzzy statistical analysis,” Computers & Chemical Engineering, vol. 29, no. 6, pp. 1345–1356, 2005. View at: Publisher Site | Google Scholar
  13. L. E. Peterson and M. A. Coleman, “Comparison of gene identification based on artificial neural network pre-processing with k-means cluster and principal component analysis,” in Fuzzy Logic and Applications, I. Bloch, A. Petrosino, and A. G. B. Tettamanzi, Eds., vol. 3849 of Lecture Notes in Computer Science, pp. 267–276, 2006. View at: Publisher Site | Google Scholar
  14. C. Liao, S. Li, and Z. Luo, “Gene selection using Wilcoxon rank sum test and support vector machine for cancer classification,” in Computational Intelligence and Security, Y. Wang, Y.-M. Cheung, and H. Liu, Eds., vol. 4456 of Lecture Notes in Computer Science, pp. 57–66, 2007. View at: Publisher Site | Google Scholar
  15. M. West, C. Blanchette, H. Dressman et al., “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 20, pp. 11462–11467, 2001. View at: Publisher Site | Google Scholar
  16. T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–527, 1999. View at: Publisher Site | Google Scholar
  17. E. B. Huerta, B. Duval, and J.-K. Hao, “A hybrid GA/SVM approach for gene selection and classification of microarray data,” in Applications of Evolutionary Computing, F. Rothlauf, J. Branke, S. Cagnoni et al., Eds., vol. 3907 of Lecture Notes in Computer Science, pp. 34–44, Springer, Berlin, Germany, 2006. View at: Publisher Site | Google Scholar
  18. U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999. View at: Publisher Site | Google Scholar
  19. K. C. Mondal, A. Mukhopadhyay, U. Maulik, S. Bandhyapadhyay, and N. Pasquier, “MOSCFRA: a multi-objective genetic approach for simultaneous clustering and gene ranking,” in Computational Intelligence Methods for Bioinformatics and Biostatistics, R. Rizzo and P. J. G. Lisboa, Eds., vol. 6685 of Lecture Notes in Computer Science, pp. 174–187, Springer, Berlin, Germany, 2011. View at: Publisher Site | Google Scholar
  20. K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002. View at: Publisher Site | Google Scholar
  21. R. M. Luque-Baena, D. Urda, J. L. Subirats, L. Franco, and J. M. Jerez, “Analysis of cancer microarray data using constructive neural networks and genetic algorithms,” in Proceedings of the 1st International Work-Conference on Bioinformatics and Biomedical Engineering-IWBBIO, Granada, Spain, March 2013. View at: Google Scholar
  22. M. Mandal and A. Mukhopadhyay, “A novel PSO-based graph-theoretic approach for identifying most relevant and non-redundant gene markers from gene expression data,” International Journal of Parallel, Emergent and Distributed Systems, vol. 30, no. 3, pp. 175–192, 2015. View at: Publisher Site | Google Scholar
  23. O. Soufan, D. Kleftogiannis, P. Kalnis, and V. B. Bajic, “DWFS: a wrapper feature selection tool based on a parallel genetic algorithm,” PLoS ONE, vol. 10, no. 2, Article ID e0117988, 2015. View at: Publisher Site | Google Scholar
  24. J. H. Holland, Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, UK, 2nd edition, 1970.
  25. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer, New York, NY, USA, 3rd edition, 1996.
  26. P. T. Spellman, G. Sherlock, M. Q. Zhang et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular Biology of the Cell, vol. 9, no. 12, pp. 3273–3297, 1998. View at: Publisher Site | Google Scholar
  27. A. Khamas, T. Ishikawa, K. Shimokawa et al., “Screening for epigenetically masked genes in colorectal cancer using 5-aza-2′-deoxycytidine, microarray and gene expression profile,” Cancer Genomics and Proteomics, vol. 9, no. 2, pp. 67–75, 2012. View at: Google Scholar
  28. T. Sato, A. Kaneda, S. Tsuji et al., “PRC2 overexpression and PRC2-target gene repression relating to poorer prognosis in small cell lung cancer,” Scientific Reports, vol. 3, article 1911, 2013. View at: Publisher Site | Google Scholar
  29. D. Singh, P. G. Febbo, K. Ross et al., “Gene expression correlates of clinical prostate cancer behavior,” Cancer Cell, vol. 1, no. 2, pp. 203–209, 2002. View at: Publisher Site | Google Scholar
  30. M. A. Shipp, K. N. Ross, P. Tamayo et al., “Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nature Medicine, vol. 8, no. 1, pp. 68–74, 2002. View at: Publisher Site | Google Scholar
  31. M. H. Cheok, W. Yang, C.-H. Pui et al., “Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells,” Nature Genetics, vol. 34, no. 1, pp. 85–90, 2003. View at: Publisher Site | Google Scholar
  32. J. C. Chang, E. C. Wooten, A. Tsimelzon et al., “Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer,” The Lancet, vol. 362, no. 9381, pp. 362–369, 2003. View at: Publisher Site | Google Scholar

Copyright © 2016 Sujay Saha et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

944 Views | 457 Downloads | 2 Citations
 PDF  Download Citation  Citation
 Download other formatsMore
 Order printed copiesOrder

Related articles

We are committed to sharing findings related to COVID-19 as quickly and safely as possible. Any author submitting a COVID-19 paper should notify us at help@hindawi.com to ensure their research is fast-tracked and made available on a preprint server as soon as possible. We will be providing unlimited waivers of publication charges for accepted articles related to COVID-19. Sign up here as a reviewer to help fast-track new submissions.