BioMed Research International

Volume 2017 (2017), Article ID 3020627, 17 pages

https://doi.org/10.1155/2017/3020627

## Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis

^{1}Lab of Bioinformatics, Department of Statistics, University of Rajshahi, Rajshahi 6205, Bangladesh^{2}Department of Statistics, Begum Rokeya University, Rangpur, Rangpur 5400, Bangladesh

Correspondence should be addressed to Md. Shakil Ahmed

Received 18 March 2017; Revised 10 June 2017; Accepted 14 June 2017; Published 7 August 2017

Academic Editor: Federico Ambrogi

Copyright © 2017 Md. Shakil Ahmed et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The naïve Bayes classifier (NBC) is one of the most popular classifiers for class prediction or pattern recognition from microarray gene expression data (MGED). However, it is very much sensitive to outliers with the classical estimates of the location and scale parameters. It is one of the most important drawbacks for gene expression data analysis by the classical NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum -divergence method. The role of minimum -divergence method in this article is to produce the robust estimators for the location and scale parameters based on the training dataset and outlier detection and modification in test dataset. The performance of the proposed method depends on the tuning parameter . It reduces to the traditional naïve Bayes classifier when . We investigated the performance of the proposed beta naïve Bayes classifier (-NBC) in a comparison with some popular existing classifiers (NBC, KNN, SVM, and AdaBoost) using both simulated and real gene expression datasets. We observed that the proposed method improved the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.

#### 1. Introduction

Classification is a supervised learning approach for separation of multivariate data into various sources of populations. It has been playing significant roles in bioinformatics by class prediction or pattern recognition from molecular OMICS datasets. Microarray gene expression data analysis is one of the most important OMICS research wings for bioinformatics [1]. There are several classification and clustering approaches that have been addressed previously for analyzing MGED [2–11]. The Gaussian linear Bayes classifier (LBC) is one of the most popular classifiers for class prediction or pattern recognition. However, it is not so popular for microarray gene expression data analysis, since it suffers from the inverse problem of its covariance matrix in presence of large number of genes (*p*) with small number of patients/samples (*n*) in the training dataset. The Gaussian naïve Bayes classifier (NBC) overcomes this difficulty of Gaussian LBC by taking the normality and independence assumptions on the variables. If these two assumptions are violated, then the nonparametric version of NBC is suggested in [12]. In this case the nonparametric classification methods work well but they produce poor performance for small sample sizes or in presence of outliers. In MGED the small samples are conducted because of cost and limited specimen availability [13]. There are some other versions of NBC also [14, 15]. However, none of them are so robust against outliers. It is one of the most important drawbacks for gene expression data analysis by the existing NBC. The gene expression dataset is often contaminated by outliers due to several steps involved in the data generating process from hybridization of DNA samples to image analysis. Therefore, in this paper, an attempt is made to robustify the Gaussian NBC by the minimum -divergence method within two steps. At step-1, the minimum -divergence method [16–18] attempts to estimate the parameters for the Gaussian NBC based on the training dataset. At step-2, an attempt is made to detect the outlying data vector from the test dataset using the -weight function. Then an attempt is made to propose criteria to detect the outlying components in the test data vector and the modification of outlying components by the reasonable values. It will be observed that the performance of the proposed method depends on the tuning parameter and it reduces to the traditional Gaussian NBC when . Therefore, we call the proposed classifier as -NBC.

An attempt is made to investigate the robustness performance of the proposed -NBC in a comparison with several versions of robust linear classifiers based on M-estimator [19, 20], MCD (Minimum Covariance Determinant), and MVE (Minimum Volume Ellipsoid) estimators [21, 22], Orthogonalized Gnanadesikan-Kettenring (OGK) estimator including MCD-A, MCD-B, and MCD-C [23], and Feasible Solution Algorithm (FSA) classifiers [24–26]. We observed that the proposed -NBC outperforms existing robust linear classifiers as mentioned earlier. Then we investigate the performance of the proposed method in a comparison with some popular classifiers including Support Vector Machine (SVM),* k*-nearest neighbors (*K*NN), and AdaBoost; those are widely used in gene expression data analysis [27–29]. We observed that the proposed method improves the performance over the others in presence of outliers. Otherwise, it keeps almost equal performance.

#### 2. Methodology

##### 2.1. Naïve Bayes Classifier

The naïve Bayes classifiers (NBCs) [30] are a family of probabilistic classifiers depending on the Bayes’ theorem with independence and normality assumptions among the variables. The common rule of NBCs is to pick the hypothesis that is most probable; this is known as the maximum a posteriori (MAP) decision rule. Assume that we have a training sample of vectors of size for , where denotes the* j*th observation of the* i*th variable in the* k*th population/class . Then the NBCs assign a class label for some* k* as follows:For the Gaussian NBC, the density function of* k*th population/class can be written aswhere and here , is the mean vector and the diagonal covariance matrix is

##### 2.2. Maximum Likelihood Estimators (MLEs) for the Gaussian NBC

We assume that the prior probabilities are known and the maximum likelihood estimators (MLEs) and of and are obtained based on the training dataset as follows:where , , and ; .

It is obvious from (1)-(2) that the Gaussian NBC depends on the mean vectors () and diagonal covariance matrix (); those are estimated by the maximum likelihood estimators (MLEs) as given in (4)–(6) based on the training dataset. Therefore, MLE based Gaussian NBC produces misleading results in presence of outliers in the datasets. To get rid of this problem, an attempt is made to robustify the Gaussian NBC by minimum -divergence method [16–18].

##### 2.3. Robustification of Gaussian NBC by the Minimum -Divergence Method (Proposed)

###### 2.3.1. Minimum -Divergence Estimators for the Gaussian NBC

Let be the true density and be the model density for* k*th populations; then the -divergence of two p.d.f can be defined byfor and . Equality holds if and only if for all . When tends to zero, -divergence reduces to Kullback Leibler (K-L) divergence; that is,The minimum -divergence estimator is defined byFor the Gaussian density and the minimum -divergence estimators and for the mean vector and the diagonal covariance matrix , respectively, are obtained iteratively as follows:whereThe formulation of (10)–(12) is straightforward as described in the previous works [17, 18]. The function in (12) is called the -weight function, which plays the key role for robust estimation of the parameters. If tends to 0, then (10) are reduced to the classical noniterative estimates of mean and diagonal covariance matrix as given in (4) and (6), respectively. The performance of the proposed method depends on the value of the tuning parameter and initialization of the Gaussian parameters .

###### 2.3.2. Parameters Initialization and Breakdown Points of the Estimates

The mean vector is initialized by the median vector, since mean and median are same for normal distribution and the median (Me) is highly robust against outliers with 50% breakdown points to estimate central value of the distribution. The median vector of* k*th class/population is defined asThe diagonal covariance matrix is initialized by the identity matrix (**I**). The iterative procedure will converge to the optimal point of the parameters, since the initial mean vector would belong to the center of the dataset with 50% breakdown points. The proposed estimators can resist the effect of more than 50% breakdown points if we can initialize the mean vector by a vector that belongs to the good part of the dataset and the variance-covariance by the identity (**I**) matrix. More discussion about high breakdown points for the minimum -divergence estimators can be found in [18].

###### 2.3.3. -Selection Using -Fold Cross Validation (CV) for Parameter Estimation

To select the appropriate by CV, we fix the tuning parameter to . The computation steps for selecting appropriate by* T*-fold cross validation is given below.

*Step 1. *Dataset is split into subsets; where and .

*Step 2. *Let for .

*Step 3. *Estimate and iteratively by (10) based on dataset .

*Step 4. *Compute CV(t) using dataset , for , where .

*Step 5. *End.

Computed suitable * by*where =

If the sample size ( is small such that , then* T* = (leave-one-out CV) can be used to select the appropriate . More discussion about selection also can be found in [16–18].

###### 2.3.4. Outlier Identification Using -Weight Function

The performance of NBC for classification of an unlabeled data vector using (1) not only depends on the robust estimation of the parameters but also depends on the values of weather it is contaminated or not. The data vector is said to be contaminated if at least one component of is contaminated by outlier. To derive a criterion of whether the unlabeled data vector is contaminated or not, we consider -weight function (12) and rewrite it as follows:The values of this weight function lie between 0 and 1. This weight function produces larger weight (but less than 1) if and smaller weight (but greater than 0) if or contaminated by outlier. Therefore, the -weight function (15) can be characterized asThe threshold value can be determined based on the empirical distribution of -weight function as discussed in [31] and by the quantile values of for with probabilitywhere is the probability for selecting the cut-off value and the value of should lie between 0.00 and 0.05. In this paper, heuristically we choose to fix the cut-off value for detection of outlying data vector using (18). This idea was first introduced in [31].

Then the criteria whether the unlabeled data vector is contaminated or not can be defined as follows:where .

However, in this paper, we directly choose the threshold value of as follows:With heuristically , where is the training dataset including the unclassified data vector , (19) was also used in the previous works in [16, 18] to choose the threshold value for outlier detection.

###### 2.3.5. Classification by the Proposed -NBC

When the unlabeled data vector is usual, the appropriate label/class of can be determined using the minimum -divergence estimators of in the predicting equation (1). If the unlabeled data vector is unusual/contaminated by outliers, then we propose a classification rule as follows. We compute the absolute difference between the outlying vector and each of mean vectors asCompute sum of the smallest* r *components of as , where * = *round . Then the unlabeled test data vector can be classified asIf the outlying test vector is classified in to class *, *then its* i*th component is said to be outlying if . Then we update by replacing its outlying components with the corresponding mean components from the mean vector of* k*th population. Let be the updated vector of . Then we use instead of to confirm the label/class of using (1).

#### 3. Simulation Study

##### 3.1. Simulated Dataset 1

To investigate the performance of our proposed (-NBC) classifier in a comparison with four popular classifiers (KNN, NBC, SVM, and AdaBoost), we generated both training and test datasets from multivariate normal distributions with different mean vectors (, ) of length but common covariance matrix (; ). In this simulation study, we generated samples from the first population and samples from the second population for both training and test datasets. We computed the training error and test error rate for all five classifiers using both original and contaminated datasets with different mean vectors , where the other parameters remain the same for each dataset. For convenience of the presentation, we distinguish the two mean vectors in such a way in which the second mean vector is generated by adding* t* with each of the components of the first mean vector.

##### 3.2. Simulated Dataset 2

To investigate the performance of the proposed classifier (-NBC) in a comparison of the classical NBC for the classification of object into two groups, let us consider a model for generating gene expression datasets as displayed in Table 1 which was also used in Nowak and Tibshirani [32]. In Table 1, the first column represents the gene expressions of normal individuals and the second column represents the gene expressions of patient individuals. First row represents the genes from group and second row represents the genes from group . To randomize the gene expression, Gaussian noise is added from . First we generate a training gene-set using the data generating model (Table 1) with parameters and , where genes denoted by are generated for group and genes denoted by are generated for group with normal individuals and patients (e.g., cancer or any other disease). Then we generate a test gene-set using the same model with the same parameters and as before, where genes denoted by are generated for group and genes denoted by are generated for group with normal individuals and patients (e.g., cancer or any other disease).