Abstract

The naïve Bayes classifier (NBC) is one of the most popular classifiers for class prediction or pattern recognition from microarray gene expression data (MGED). However, it is very much sensitive to outliers with the classical estimates of the location and scale parameters. It is one of the most important drawbacks for gene expression data analysis by the classical NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum -divergence method. The role of minimum -divergence method in this article is to produce the robust estimators for the location and scale parameters based on the training dataset and outlier detection and modification in test dataset. The performance of the proposed method depends on the tuning parameter . It reduces to the traditional naïve Bayes classifier when . We investigated the performance of the proposed beta naïve Bayes classifier (-NBC) in a comparison with some popular existing classifiers (NBC, KNN, SVM, and AdaBoost) using both simulated and real gene expression datasets. We observed that the proposed method improved the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.

1. Introduction

Classification is a supervised learning approach for separation of multivariate data into various sources of populations. It has been playing significant roles in bioinformatics by class prediction or pattern recognition from molecular OMICS datasets. Microarray gene expression data analysis is one of the most important OMICS research wings for bioinformatics [1]. There are several classification and clustering approaches that have been addressed previously for analyzing MGED [211]. The Gaussian linear Bayes classifier (LBC) is one of the most popular classifiers for class prediction or pattern recognition. However, it is not so popular for microarray gene expression data analysis, since it suffers from the inverse problem of its covariance matrix in presence of large number of genes (p) with small number of patients/samples (n) in the training dataset. The Gaussian naïve Bayes classifier (NBC) overcomes this difficulty of Gaussian LBC by taking the normality and independence assumptions on the variables. If these two assumptions are violated, then the nonparametric version of NBC is suggested in [12]. In this case the nonparametric classification methods work well but they produce poor performance for small sample sizes or in presence of outliers. In MGED the small samples are conducted because of cost and limited specimen availability [13]. There are some other versions of NBC also [14, 15]. However, none of them are so robust against outliers. It is one of the most important drawbacks for gene expression data analysis by the existing NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum -divergence method within two steps. At step-1, the minimum -divergence method [1618] attempts to estimate the parameters for the Gaussian NBC based on the training dataset. At step-2, an attempt is made to detect the outlying data vector from the test dataset using the -weight function. Then an attempt is made to propose criteria to detect the outlying components in the test data vector and the modification of outlying components by the reasonable values. It will be observed that the performance of the proposed method depends on the tuning parameter and it reduces to the traditional Gaussian NBC when . Therefore, we call the proposed classifier as -NBC.

An attempt is made to investigate the robustness performance of the proposed -NBC in a comparison with several versions of robust linear classifiers based on M-estimator [19, 20], MCD (Minimum Covariance Determinant), and MVE (Minimum Volume Ellipsoid) estimators [21, 22], Orthogonalized Gnanadesikan-Kettenring (OGK) estimator including MCD-A, MCD-B, and MCD-C [23], and Feasible Solution Algorithm (FSA) classifiers [2426]. We observed that the proposed -NBC outperforms existing robust linear classifiers as mentioned earlier. Then we investigate the performance of the proposed method in a comparison with some popular classifiers including Support Vector Machine (SVM), k-nearest neighbors (KNN), and AdaBoost; those are widely used in gene expression data analysis [2729]. We observed that the proposed method improves the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.

2. Methodology

2.1. Naïve Bayes Classifier

The naïve Bayes classifiers (NBCs) [30] are a family of probabilistic classifiers depending on the Bayes’ theorem with independence and normality assumptions among the variables. The common rule of NBCs is to pick the hypothesis that is most probable; this is known as the maximum a posteriori (MAP) decision rule. Assume that we have a training sample of vectors of size for , where denotes the jth observation of the ith variable in the kth population/class . Then the NBCs assign a class label for some k as follows:For the Gaussian NBC, the density function of kth population/class can be written aswhere and here , is the mean vector and the diagonal covariance matrix is

2.2. Maximum Likelihood Estimators (MLEs) for the Gaussian NBC

We assume that the prior probabilities are known and the maximum likelihood estimators (MLEs) and of and are obtained based on the training dataset as follows:where ,  , and ;  .

It is obvious from (1)-(2) that the Gaussian NBC depends on the mean vectors () and diagonal covariance matrix (); those are estimated by the maximum likelihood estimators (MLEs) as given in (4)–(6) based on the training dataset. Therefore, MLE based Gaussian NBC produces misleading results in presence of outliers in the datasets. To get rid of this problem, an attempt is made to robustify the Gaussian NBC by minimum -divergence method [1618].

2.3. Robustification of Gaussian NBC by the Minimum -Divergence Method (Proposed)
2.3.1. Minimum -Divergence Estimators for the Gaussian NBC

Let be the true density and be the model density for kth populations; then the -divergence of two p.d.f can be defined byfor and . Equality holds if and only if for all . When tends to zero, -divergence reduces to Kullback Leibler (K-L) divergence; that is,The minimum -divergence estimator is defined byFor the Gaussian density and the minimum -divergence estimators and for the mean vector and the diagonal covariance matrix , respectively, are obtained iteratively as follows:whereThe formulation of (10)–(12) is straightforward as described in the previous works [17, 18]. The function in (12) is called the -weight function, which plays the key role for robust estimation of the parameters. If tends to 0, then (10) are reduced to the classical noniterative estimates of mean and diagonal covariance matrix as given in (4) and (6), respectively. The performance of the proposed method depends on the value of the tuning parameter and initialization of the Gaussian parameters .

2.3.2. Parameters Initialization and Breakdown Points of the Estimates

The mean vector is initialized by the median vector, since mean and median are same for normal distribution and the median (Me) is highly robust against outliers with 50% breakdown points to estimate central value of the distribution. The median vector of kth class/population is defined asThe diagonal covariance matrix is initialized by the identity matrix (I). The iterative procedure will converge to the optimal point of the parameters, since the initial mean vector would belong to the center of the dataset with 50% breakdown points. The proposed estimators can resist the effect of more than 50% breakdown points if we can initialize the mean vector by a vector that belongs to the good part of the dataset and the variance-covariance by the identity (I) matrix. More discussion about high breakdown points for the minimum -divergence estimators can be found in [18].

2.3.3. -Selection Using -Fold Cross Validation (CV) for Parameter Estimation

To select the appropriate by CV, we fix the tuning parameter to . The computation steps for selecting appropriate by T-fold cross validation is given below.

Step 1. Dataset is split into subsets; where and .

Step 2. Let for .

Step 3. Estimate and iteratively by (10) based on dataset .

Step 4. Compute CV(t) using dataset , for , where .

Step 5. End.

Computed suitable bywhere =

If the sample size ( is small such that , then T = (leave-one-out CV) can be used to select the appropriate . More discussion about selection also can be found in [1618].

2.3.4. Outlier Identification Using -Weight Function

The performance of NBC for classification of an unlabeled data vector using (1) not only depends on the robust estimation of the parameters but also depends on the values of weather it is contaminated or not. The data vector is said to be contaminated if at least one component of is contaminated by outlier. To derive a criterion of whether the unlabeled data vector is contaminated or not, we consider -weight function (12) and rewrite it as follows:The values of this weight function lie between 0 and 1. This weight function produces larger weight (but less than 1) if and smaller weight (but greater than 0) if or contaminated by outlier. Therefore, the -weight function (15) can be characterized asThe threshold value can be determined based on the empirical distribution of -weight function as discussed in [31] and by the quantile values of for with probabilitywhere is the probability for selecting the cut-off value and the value of should lie between 0.00 and 0.05. In this paper, heuristically we choose to fix the cut-off value for detection of outlying data vector using (18). This idea was first introduced in [31].

Then the criteria whether the unlabeled data vector is contaminated or not can be defined as follows:where .

However, in this paper, we directly choose the threshold value of as follows:With heuristically , where is the training dataset including the unclassified data vector , (19) was also used in the previous works in [16, 18] to choose the threshold value for outlier detection.

2.3.5. Classification by the Proposed -NBC

When the unlabeled data vector is usual, the appropriate label/class of can be determined using the minimum -divergence estimators of in the predicting equation (1). If the unlabeled data vector   is unusual/contaminated by outliers, then we propose a classification rule as follows. We compute the absolute difference between the outlying vector and each of mean vectors asCompute sum of the smallest r components of as , where = round . Then the unlabeled test data vector can be classified asIf the outlying test vector    is classified in to class , then its ith component is said to be outlying if . Then we update   by replacing its outlying components with the corresponding mean components from the mean vector of kth population. Let be the updated vector of . Then we use instead of to confirm the label/class of   using (1).

3. Simulation Study

3.1. Simulated Dataset 1

To investigate the performance of our proposed (-NBC) classifier in a comparison with four popular classifiers (KNN, NBC, SVM, and AdaBoost), we generated both training and test datasets from multivariate normal distributions with different mean vectors (, ) of length but common covariance matrix (; ). In this simulation study, we generated samples from the first population and samples from the second population for both training and test datasets. We computed the training error and test error rate for all five classifiers using both original and contaminated datasets with different mean vectors , where the other parameters remain the same for each dataset. For convenience of the presentation, we distinguish the two mean vectors in such a way in which the second mean vector is generated by adding t with each of the components of the first mean vector.

3.2. Simulated Dataset 2

To investigate the performance of the proposed classifier (-NBC) in a comparison of the classical NBC for the classification of object into two groups, let us consider a model for generating gene expression datasets as displayed in Table 1 which was also used in Nowak and Tibshirani [32]. In Table 1, the first column represents the gene expressions of normal individuals and the second column represents the gene expressions of patient individuals. First row represents the genes from group and second row represents the genes from group . To randomize the gene expression, Gaussian noise is added from . First we generate a training gene-set using the data generating model (Table 1) with parameters and , where genes denoted by are generated for group and genes denoted by are generated for group with normal individuals and patients (e.g., cancer or any other disease). Then we generate a test gene-set using the same model with the same parameters and as before, where genes denoted by are generated for group   and genes denoted by are generated for group   with normal individuals and patients (e.g., cancer or any other disease).

3.3. Simulated Dataset 3

To demonstrate the performance of the proposed classifier (-NBC) in a comparison of some other robust linear classifiers based on the robust estimators (MCD, MVE, OGK, MCD-A, MCD-B, MCD-C, and FSA) as mentioned earlier for the classification of object into different groups, we have generated the training and test datasets from , 3 multivariate normal distributions with variables p = 10, 5, respectively. We consider and () samples from different multivariate normal populations and . Here with such that for ; otherwise , where the scalar number Ω is the common difference between two corresponding mean components of and , respectively. Similarly, for generating the training and test datasets, we consider the , , and () samples from . It is carried out with different means and common variance-covariance matrix of multivariate normal populations , , and . In this case we consider with and such that for ; otherwise , where the scalar number Ω is the common difference among the corresponding mean components of , , and , respectively.

3.4. Head and Neck Cancer Gene Expression Dataset

To demonstrate the performance of the proposed classifier (-NBC) in a comparison with four popular classifiers (KNN, NBC, SVM, and AdaBoost) with the real gene expression dataset, we considered the head and neck cancer (HNC) gene expression dataset from the previous work [33]. The term head and neck cancer denotes a group of biologically comparable cancers originating from the upper aero digestive tract, including the following parts of human body: lip, oral cavity (mouth), nasal cavity, pharynx and larynx, and paranasal sinuses. This microarray gene expression dataset contains 12626 genes, where 594 genes are differentially expressed and the rest of the genes are equally expressed.

4. Simulation and Real Data Analysis Results

4.1. Simulation Results of Dataset 1

We have used the simulated dataset 1 to investigate the performance of the proposed method with the performance of the other popular classifiers such as classical NBC, SVM, KNN, and AdaBoost. Figures 1(a)1(f) represent the test error rate estimated by these five classifiers against the common mean differences in absence of outliers (original dataset) and in presence of 5%, 10%, 15%, 20%, and 25% outliers in test dataset, respectively. From Figure 1(f) it is evident that in absence of outlier every method produces almost the same result, whereas, in presence of different levels of outliers (see Figures 1(a)1(e)), the proposed method outperformed the other methods by producing low test error rate. Table 2 is summarized with different performance measures (accuracy, sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV), prevalence, detection rate, detection prevalence, Matthews correlation coefficient (MCC), and misclassification error rate). All these performance measures are computed by the five methods (NBC, KNN, SVM, AdaBoost, and proposed).

From Table 2 we observed that the proposed method produces better results than the other classifiers (NBC, SVM, KNN, and AdaBoost), since it produces higher values of accuracy (>97%), sensitivity (>95%), specificity (>94%), PPV (>94%), NPV (>94%), and MCC (>94%) and lower values of prevalence and MER (<4%). The proportion test statistic [34] has been used to test the significance of several proportions produced by the five classifiers for each of the performance measures. The column 7 of Table 2 represents the values of this test statistic. Since all the values except MER are less than 0.01, so we can conclude that the performance results are highly statistically significant. The MER ( value < 0.05) is also statistically significant at 5% level of significance. So we may conclude from simulated dataset 1 that our proposed method performed better than the other classical methods for the contaminated dataset. It keeps equal performance in absence of outliers for the original dataset.

4.2. Simulation Results of Dataset 2

To investigate the performance of the proposed classifier (-NBC) in a comparison of the classical NBC for the classification of objects into two groups, we considered the simulated dataset 2. Figures 2(a) and 2(b) show training and test datasets in absence of outliers, respectively. Here genes are randomly allocated in the test dataset. Figures 3(a) and 3(b) show the results of classified test dataset by classical and proposed NBC, respectively.

From classification results we observed that both the naïve Bayes procedures and proposed method produce almost the same results with low misclassification error rates in absence of outliers. To investigate the robustness performance of our proposed method in a comparison with the conventional naïve Bayes procedure for classification, we randomly contaminated 30% genes by outliers in the test gene-sets (Figures 4(a)4(c)).

To classify sample into any one of the groups using the contaminated test gene-set (Figure 4(a)), we calculated the misclassification error rate by NBC and proposed method. From Figure 4 we see that the traditional naïve Bayes procedures fail to achieve correct classification (Figure 4(b)) and the misclassification error rate is 34%. Then we try to classify objects/patients using the proposed method which is shown in Figure 4(c). It is obvious from these figures that the classification performance of the proposed method is good and the misclassification error rate is approximately 5% for test gene datasets.

4.3. Simulation Results of Dataset 3

We also investigated the performance of the proposed robust naïve Bayes classifier in a comparison with classical naïve Bayes as well as robust linear classifier based on the MVE, FSA, MCD, MCD-A, MCD-B, MCD-C, and OGK estimators of the mean vectors and covariance matrices. We computed different performance measures such as average of true positive rate (TPR), false positive rate (FPR), area under the ROC curve (AUC), and partial AUC (pAUC) based on 50 replications of the dataset to measure the performance of all classifiers. A method is said to be better than others, if it produces larger values of TPR, AUC, and pAUC and smaller values of FPR and MER.

Table 3 shows the average values of AUC and pAUC at FPR = 0.2 based on the 50 replicated simulated datasets 3 with for the two- (2-) class classification. The performance measures have been estimated by the classical, FSA, MCD, MVE, MCD-A, MCD-B, MCD-C, OGK, and proposed methods. They show the average estimates of AUC and pAUC for seven classifiers using simulated dataset 3 in absence and presence of outliers. We observed that in absence of outliers all the classifiers produce almost similar results. The proposed classifiers produced better result than the classical NBC and other robust estimators in presence of different levels (5%, 10%, 15%, 20%, and 25%) of outliers. Also MCD, MCD-A, MCD-B, and MCD-C show the constant performance result at the same level of outlier rate and varied for the different level of outlier rates. The ROC analysis also supported these results which are shown in Figures 5(a)5(f), so we may conclude that the proposed method outperformed the others.

To investigate the performance of the proposed method in a comparison with other methods (classical, FSA, MCD, MVE, MCD-A, MCD-B, MCD-C, OGK, and proposed) for multiclass classification problem. We generated simulated datasets 3 based on 50 replicated with the number of variables. The performance measures were estimated for each of these methods. Table 4 shows the average standard error of AUC and pAUC for multiclass classification. It is revealed that the proposed robust naïve Bayes classifier outperformed the classical and other robust linear classifiers in presence of outliers with false positive rate 0.2. The proposed method produces the larger values of AUC and pAUC and shows the lower values of MER and standard error of AUC and pAUC values. The performance measures using different types of MCD estimators were shown in the constant result at the same level of outlier rate. It was varied for the different levels of contamination rate.

4.4. Head and Neck Cancer Gene Expression Data Analysis

We also investigated the performance of the proposed method in real microarray gene expression dataset. The normalized Head and Neck cancer (HNC) dataset is considered here [33]. The RNA sample was extracted from the 22 normal and 22 cancer tissues for generating the HNC dataset. The Affymetrix GeneChip was used for processing RNA samples and finally got the quantified CELL file format. The Robust Multichip Analysis (RMA) and quantile normalization methods were used for processing the CELL files. The HNC dataset was 12,642 probe sets, 44 samples, and 42 significantly differentially expressed probe sets. The detailed discussion is shown in [33] for preprocessing of HNC dataset. We first select the differentially expressed (DE) genes whose posterior probability is more than 0.9; otherwise the genes are equally expressed (EE) using bridge R package [35] which is shown in Figure 6 that shows 594 differentially expressed genes from 12626 genes. We have performed the Anderson-Darling (A-D) normality test [36, 37] for the HNC dataset. The results show that a few numbers of DE genes (5%) for both normal and cancer groups break the normality assumption at 1% level of significance. Also we checked the independence assumption of DE genes using the mutual information [38]. We found that the mutual information for HNC dataset is 0.044 which is almost close to zero for both normal and cancer groups. So we may conclude that the DE genes almost satisfy the independence assumption. Therefore, we may assume that the HNC dataset almost satisfies the normality and independence assumption of NBC for a given class/groups.

For classification problem, we have considered half of the differentially expressed genes as training gene-set and we identified their group using hierarchical clustering (HC). Figure 7 represents the dendrogram of HC of half of the differentially expressed genes for training data. The rest of the 297 differentially expressed genes are considered as a test gene-set. Then we employed both classical NBC and robust NBC (-NBC) in this dataset to classify cancer genes (see Figures 8(a)8(d)). We observed that from Figure 8 the traditional naïve Bayes procedure can not find the group of gene properly whereas our proposed method (-NBC) performs better for identifying the gene group in the HNC dataset. Figure 8(d) shows that the proposed classifier shows better performance for classifying the samples than the classical method (Figure 8(c)).

We also computed different performance measures (accuracy, sensitivity, specificity, positive predicted value (PPV), negative predicted value (NPV), prevalence, detection rate, detection, prevalence, Matthews correlation coefficient (MCC), and misclassification error rate) by the five classification methods (NBC, KNN, SVM, AdaBoost, and proposed) using HNC dataset (Table 5). From Table 5 we have observed that the proposed classifier produces better results than the other classifiers (NBC, SVM, KNN, and AdaBoost). The proportion test [34] has shown that the values <0.01 for the different performance results excluding MCC and MER. Then we may say that they are highly statistically significant. The MCC and MER are statistically significant at 5% level of significance because of the values < 0.05. Hence, the performances of the proposed methods in real HNC data analysis are better than classical and other methods. Also this data set is contaminated by outliers reported in [31]. So we consider this dataset to investigate the performance of the proposed method in a comparison of some popular existing classifiers. We observed that the proposed method outperforms the others for this HNC dataset.

5. Discussion

In this paper, we discussed the robustification of Gaussian NBC using the minimum -divergence method within two steps. For both simulated and real data analysis, at first, the mean vectors and the diagonal covariance matrices were computed by the minimum -divergence estimators for the Gaussian NBC based on the training dataset. Then outlying test data vectors were detected from the test dataset using the -weight function and outlying components in each test data vector were replaced by the corresponding values of their estimated mean vectors. Then the modified test data vectors were used as the input data vectors in the proposed -NBC for their class prediction or pattern recognition. The rest of the data vectors from the test dataset were directly used as the input data vectors in the proposed -NBC for their class prediction or pattern recognition. We observed that the performance of the proposed method depends on the tuning parameter and the initialization of the Gaussian parameters. Therefore, in this paper, we also discussed the initialization procedure for the Gaussian parameters and the -selection procedure using cross validation in Sections 2.3.2 and 2.3.3, respectively. The classifier reduces to the traditional Gaussian NBC when . Therefore, we call the proposed classifier -NBC. We investigated the robustness performance of the proposed -NBC in a comparison of several robust versions of linear classifiers based on MCD, MVE, and OGK estimators taking the smaller number of variables/genes (p) with larger number of patients/samples (n) in the training dataset, since these types of robust classifiers also suffer from the inverse problem of its covariance matrix in presence of large number of variables/genes (p) with small number of patients/samples () in the training dataset. We observed that the proposed -NBC outperforms the existing robust linear classifiers as early mentioned in presence of outliers. Otherwise, it keeps almost equal performance. Then we investigated the performance of the proposed method in a comparison of some popular classifiers including Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and AdaBoost which are widely used for gene expression data analysis [2729]. In that comparison, we used both simulated and real gene expression datasets. We observed that the proposed method improves the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance as before. The main advantage of the proposed classifier over the others is that it works well for both conditions of (i) and (ii) , and it can resist the effect of 50% breakdown points. If the dataset does not satisfy the normality assumptions, then the proposed method may show weaker performance than others in absence of outliers. However, the nonnormal dataset can transform to the normal dataset by some suitable transformation like Box-Cox transformation [39]. Then the proposed method would be useful to tackle the outlying problems. The proposed method may also suffer from the correlated observations. In that case, correlated observations can be transforming to the uncorrelated observations using standard principal component analysis (PCA) or singular value decomposition (SVD) based PCA. Then the proposed method would be more useful to tackle the outlying problems as before. However, in our current studied in this paper, we investigated the performance of the proposed classifier (-NBC) in a comparison of some popular existing classifiers (NBC, KNN, SVM, and AdaBoost) including some robust linear classifiers (MCD, MVE, OGK, MCD-A, MCD-B, MCD-C, and FSA) using both simulated and real gene expression datasets, where simulated datasets satisfied the normality and independent assumptions. We observed that the proposed method improved the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance. Usually gene expression datasets are often contaminated by outliers due to several steps involved in the data generating process from hybridization to image analysis. Therefore the proposed method would be more suitable for gene expression data analysis.

6. Conclusion

The accurate sample class prediction or pattern recognition is one of the most significant issues for MGED analysis. The naïve Bayes classifier is an important and widely used method for the class prediction in bioinformatics. However, this method suffers from outlying problems to estimate the location parameters in the MGED analysis. To overcome this we proposed -NBC for estimating the robust location and scale parameters. In the simulation studies 1 and 2, we showed that, in presence of outliers, the proposed -NBC outperforms other popular classifiers while datasets were generated from the multivariate and univariate normal distribution, respectively, and it keeps equal performance with the other classifiers, in absence of outliers. We also investigated the robustness performance of the proposed -NBC in a comparison of linear classifier using some popular robust estimators in the simulation study 3. From this simulation study we observed that the proposed -NBC outperforms existing robust linear classifiers. Finally we applied in the real HNC dataset; our proposed -NBC showed better performance than the other traditional classifiers. Therefore, we may conclude that, in presence of outliers, our proposed -NBC outperforms other methods using both simulated and real datasets.

Additional Points

Supplementary Materials. The source code is written in R which is available in the Supplementary Material, available online at https://doi.org/10.1155/2017/3020627.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by HEQEP Subproject (CP-3603, W2, R3), Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh.

Supplementary Materials

The R source code for the robust naive Bayes classifier (β-NBC). The robust classification of microarray gene expression data.

  1. Supplementary Material