BioMed Research International

Volume 2017, Article ID 2437608, 11 pages

https://doi.org/10.1155/2017/2437608

## Metabolomic Biomarker Identification in Presence of Outliers and Missing Values

^{1}Bioinformatics Lab, Department of Statistics, Rajshahi University, Rajshahi, Bangladesh^{2}Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh^{3}Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh^{4}Institute of Biological Sciences, Rajshahi University, Rajshahi, Bangladesh

Correspondence should be addressed to Nishith Kumar; moc.liamg@90urb.kn

Received 24 October 2016; Revised 15 December 2016; Accepted 18 January 2017; Published 14 February 2017

Academic Editor: Peter J. Oefner

Copyright © 2017 Nishith Kumar et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Metabolomics is the sophisticated and high-throughput technology based on the entire set of metabolites which is known as the connector between genotypes and phenotypes. For any phenotypic changes, potential metabolite (biomarker) identification is very important because it provides diagnostic as well as prognostic markers and can help to develop new biomolecular therapy. Biomarker identification from metabolomics data analysis is hampered by the use of high-throughput technology that provides high dimensional data matrix which contains missing values as well as outliers. However, missing value imputation and outliers handling techniques play important role in identifying biomarker correctly. Although several missing value imputation techniques are available, outliers deteriorate the accuracy of imputation as well as the accuracy of biomarker identification. Therefore, in this paper we have proposed a new biomarker identification technique combining the groupwise robust singular value decomposition, -test, and fold-change approach that can identify biomarkers more correctly from metabolomics dataset. We have also compared the performance of the proposed technique with those of other traditional techniques for biomarker identification using both simulated and real data analysis in absence and presence of outliers. Using our proposed method in hepatocellular carcinoma (HCC) dataset, we have also identified the four upregulated and two downregulated metabolites as potential metabolomic biomarkers for HCC disease.

#### 1. Introduction

Biomarker discovery is comparatively new and one of the most dominating fields of biological research. Different disciplines are involved with the study of biomarkers: clinical and environmental chemistry, molecular biology, toxicology, food research, plant and animal biology, and so on. In general, biomarkers identification research is needed whenever a whole set/pool set of features are differentiated into two or more groups of samples. In the area of metabolomics, biomarkers are small molecules (metabolites) that can be used for early warning indicators of disease, observing disease progression, and predicting receptivity to treatment. Thus, biomarkers may be categorized into three most important groups: diagnostic, prognostic, and predictive markers [1]. Diagnostic markers are required for early and/or accurate diagnosis of disease to allow best possible treatment choices. On the other hand, prognostic markers give information about future route of a disease that would influence the treatment choices. Predictive markers provide sagacity on the potential responses of an individual to the different treatment options. Therefore, biomarker discovery is very important for the development of predictive, preventive, and personalized medicine [2].

For every disease or phenotypic changes, some metabolites are upregulated and/or some are downregulated from standard within a cell. The metabolites which are upregulated or downregulated between disease and control groups are known as differentially expressed (DE) metabolites. The classic approach to identify differentially expressed metabolites is Student’s -test (for two classes of samples). If the pooled variance of two groups is small, then Student’s -test often increases the false discovery (FDR) for metabolic biomarker identification. However, biological fold-change (FC) approach is used to control the FDR [3]. Recently, volcano plot [4] is using to identify differentially expressed metabolites based on both values from -test and fold-change (FC) values [5]. In most of the cases, we get a large number of DE metabolites; therefore, we need to identify the most influential feature/metabolites from the set of DE metabolites. The conventional approaches for ranking the influential metabolites are value and fold-change approach. However, Gevrey et al. in 2003 [6] computed the contribution of features using artificial neural network model and also Rakotomamonjy (2003) [7] and Ishak and Ghattas (2005) [8] suggested a new method for variable/feature selection on ranking score derived from support vector machine (SVM). Thus, in this paper, we have used the state of the art supervised learning technique SVM for ranking the most influential metabolites (biomarkers) according to the importance. However, the performance of true biomarker discovery is very much influenced by missing value imputation techniques [9] as well as outliers handling [10]. It is well known that mass spectrometry (MS) based metabolomics dataset frequently contains missing values [11–13] and often contains outliers [14] due to several reasons, like experimental, analytical, computational, and biological hazard. Although several missing imputation techniques are available for metabolomics data analysis like Zero imputation [11], mean imputation [11], median imputation [11], kNN imputation [15], RF imputation [16], missing value replaced by half of the minimum value found in each metabolite [17], replacing missing values by probabilistic principal component analysis (PPCA) [18], Bayesian PCA (BPCA) [9, 19], multiple imputation with expectation maximization (EM) algorithm, and Monte Carlo Markov chain (MCMC) method [20], and so on, all the above methods can only solve the missing imputation problem. However, these missing imputation techniques are sensitive to outliers and cannot handle the outliers problem simultaneously because these methods did not implement any robust function or outliers detection and replacement algorithms directly. Moreover, the published outliers handling method does not deal with missing values.

Therefore, in this paper we have developed a new robust technique for biomarker identification using the groupwise RSVD [21], -test, FC analysis, and SVM based feature selection approaches that can identify biomarker correctly by imputing missing values and solving outliers problem simultaneously.

To compare the performance of the proposed biomarker identification technique with those of the conventional techniques, we considered three well known traditional and recently used biomarker identification techniques (missing imputation techniques with -test and FC values): Zero imputation with -test and FC (), kNN imputation with with -test and FC (), and RF imputation with -test and FC ().

#### 2. Materials and Methods

In this paper, we have proposed a robust technique to identify metabolomic biomarker that can handle missing values as well as outliers problem at the same time. To investigate the performance of the proposed method, we considered existing three popular missing value imputation techniques, Zero imputation, kNN imputation, and RF imputation, and also used -test and fold-change approach for biomarker identification. In Zero imputation, all the missing values of a dataset were replaced by Zero. In kNN imputation [15], missing value was computed by averaging of nonmissing values of its first number of nearest neighbours. In R platform kNN imputation method can be found from the library “*impute.*” Missing imputation algorithm through random forest [16] is a tree based regression and classification technique which is suitable for both parametric and nonparametric dataset [22]. In R-language this method can be found from the library “*missForest.*” The short description of the proposed biomarker identification technique is given below.

##### 2.1. Biomarker Identification Technique in Presence of Outliers and Missing Values (Proposed)

Let us consider a metabolomics data matrix , and , with metabolites and samples that contains missing values and outliers, where the rows of represent the different metabolites and the columns of represent the different samples. In metabolomics data analysis, we can talk about biomarker identification whenever a metabolite or a set of metabolites are differentially expressed between two groups (disease and control) of samples in a metabolomics dataset. In that case, the metabolomics dataset can be partitioned according to the groups of samples aswhere is the number of subjects of group-1 and is the number of subjects of group-2.

In this paper, we have proposed a new RSVD based metabolomic biomarker identification technique in presence of missing values and outliers. The RSVD of a data matrix can be written as , where is a column orthonormal matrix, is a row orthogonal matrix, and is a diagonal matrix that contains the singular values. and contain the eigen vectors of and , respectively. If we want to approximate by using rank then we can write as . The number of is selected in such a way that first number of ’s can explain at least ()100% of total variation of data, where the value of depends on user interest. The steps of the metabolomic biomarker identification algorithm are given below.

*Step 1. *Partition the matrix as according to disease and control group of samples, where

*Step 2. *Apply the RSVD for each partitioned matrix and approximate those as by rank , where is selected in such a way that first* r* number of ’s can explain at least ()100% of total variation of data, where the value of depends on user interest; in our case we choose .

*Step 3. *Reconstruct each partitioned data matrix as by replacing the missing values and outlying cells of using the corresponding values of the approximated data matrices , respectively, where outlier is detected by using outlier detection rules like interquartile range (IQR) rule [23].

*Step 4. *Reconstruct the metabolomics data matrix as .

*Step 5. *Compute the differentially expressed metabolite from the reconstructed metabolomics data matrix using value and FC value. The value can be calculated using -test. To declare the metabolite as differentially expressed, choose the threshold value using Bonferroni correction and absolute fold-change (FC) cut-offs of >2.

*Step 6. *Rank the differentially expressed metabolites according to their influence or importance using support vector machine (SVM) [24].

*Step 7. *List the first few top upregulated and downregulated metabolites from Step 6 that classify the samples with higher accuracy and declare those metabolites as biomarker. Upregulated and downregulated metabolites can be identified by the sign of fold-change values.

The R-code of the proposed method can be found in the following URL: http://www.statru.org/wp-content/uploads/2010/12/MetabBio.zip

##### 2.2. Dataset Description

Both simulated and real metabolomics datasets have been used to demonstrate the performance of the proposed method in a comparison of the other methods.

###### 2.2.1. Simulated Dataset

We have simulated metabolomics datasets using the one-way ANOVA model, defined by , where is the intensity of th metabolite, th group, and th sample; denote the average intensity of metabolite ; is the th group effect for th metabolite and is the random error term. In this linear model, we have taken *uniform *(4, 8) and . We have also taken two types of metabolites: (i) differentially expressed (DE) metabolites and (ii) equally expressed (EE) metabolites. DE metabolites are divided into two groups: upregulated metabolites and downregulated metabolites. Disease and control group effects for upregulated metabolites are and ; for downregulated metabolites disease and control group effects are and and in case of equally expressed metabolites the group effect for both cases is . To make a simulated metabolomics dataset, we have taken 500 metabolites (30 DE metabolites and 470 EE metabolites) and 80 samples (45 control and 35 disease). From 500 metabolites, the first 15 metabolites have been included as upregulated metabolites and 121 to 135 metabolites have been incorporated as downregulated metabolites for disease samples. We have also included different rates of missing values (10%, 15%, and 20%) in this dataset, where 50% missing were given at random and the rest of the missing were given for lower values. To measure the performance of the proposed method in different condition, we have incorporated different rates (3%, 5%, 7%, 10%, and 15%) of outliers in simulated dataset. The outliers in the th metabolite were taken from normal distribution with mean = and variance = , where and are the mean and variance of the th metabolite. We have distributed the outliers randomly by different rates (3%, 5%, 7%, 10%, and 15%) in the data matrix cell; therefore, outliers may fall anywhere in the data matrix. We simulated 100 datasets using the above conditions and also measured the performance of the proposed method using these simulated dataset.

###### 2.2.2. Real Metabolomics Dataset

In this paper, we have used a publicly available real metabolomics dataset on the metabolomic effect of hepatocellular carcinoma (HCC) that contains the abundance level measurements of metabolites from different subjects. This dataset was originally produced by Patterson et al. [25]. To extract the metabolomic profile from the sample, ultra-performance liquid chromatography coupled with electrospray ionization/quadrupole-time-of-flight mass spectrometry (UPLC-ESI-Q-TOF-MS) was used and this dataset was normalized by Pareto scaling. This data matrix contains 1388 rows and 57 columns. Each row is a metabolite detected by the retention time (rt) and mass to charge () ratio that were included in first column and the second column, respectively. The remaining 55 columns were different subjects that came from two groups. 20 subjects were from the hepatocellular carcinoma (HCC) group and 35 subjects were from the control group. There were 26.52% missing values/cells in this dataset. More details about the data can be found in the article of Patterson et al. [25]. To measure the performance of the proposed method, we calculated the error between original and reconstructed data that we can understand how well the proposed method reconstructs the data. We also modified the real dataset by replacing 5% existing values as missing and changing 5%, 10%, and 15% real values by , where and are the mean and variance of the th metabolite in the HCC data matrix.

##### 2.3. Performance Measurement Criteria

To investigate the performance of the proposed method, we calculated the root mean square error (RMSE) between the original dataset and reconstructed dataset that we can easily identify better method to reconstruct the data. The formula of RMSE is , where and are the original and reconstructed value of the th row and th column of the data matrix, respectively.

For simulated dataset, we also assess the performance of the proposed biomarker identification technique compared to the other techniques where we consider three performance indices: (i) ROC curve, (ii) area under the ROC curve (AUC), and (iii) misclassification error rate (MER). To calculate the above measures, first we identify the DE metabolites using four different techniques including the proposed technique. Using the above results, we draw ROC curve using true positive rate (TPR) and false positive rate (FPR). In a binary classification system if a positive instance is correctly classified as positive, it is called true positive (TP); however, if it is wrongly classified as negative then it is called false negative (FN). If a negative instance is correctly classified as negative, it is called true negative (TN); otherwise, it is called false positive (FP). The TPR and FPR are calculated byROC curve plots contain the TPR in -axis and FPR in -axis.

In real metabolomics dataset, we do not know the class of metabolites; therefore, we measure the performance of the proposed biomarker identification method in a comparison of the other considering methods by sample classification using SVM classifier on the basis of the computed DE metabolites. Using the sample classification results, we calculate the performance indices: ROC curve, accuracy, sensitivity, specificity, positive predicted value, negative predicted value, and balanced accuracy.

A better method would produce smaller values for FPR and MER and larger values for TPR and area under the ROC curve (AUC) by any DE metabolite detector.

#### 3. Results and Discussion

We have evaluated the performance of the proposed method using both simulated and real data (hepatocellular carcinoma) analysis results. The detailed description of the simulated and real data analysis results is discussed in Sections 3.1 and 3.2.

##### 3.1. Simulated Data Analysis Results

To investigate the performance of the proposed method for imputing missing values, we calculated the average RMSE between original and reconstructed data matrix using 100 simulated datasets with different rates (10%, 15%, and 20%) of missing values in both absence and presence of outliers. The average RMSE for different methods in both different rates (10%, 15%, and 20%) of missing values and presence of different rates (0%, 3%, 5%, 7%, 10%, and 15%) of outliers have been given in Table 1. Table 1 showed that in absence of outliers all the methods except Zero imputation produce similar as well as smaller average RMSE between original and reconstructed data matrices; however, in presence of outliers, the proposed method gave the smaller average RMSE between original and reconstructed data matrices compared to other (Zero, kNN, and Zero imputation) techniques. Therefore, we could say that the proposed method reconstructs the simulated data very closely in presence of both different rates (0%, 3%, 7%, 10%, and 15%) of outliers and different rates (10%, 15%, and 20%) of missing values.