BioMed Research International

Volume 2015, Article ID 472197, 7 pages

http://dx.doi.org/10.1155/2015/472197

## Near-Infrared Spectroscopy as a Diagnostic Tool for Distinguishing between Normal and Malignant Colorectal Tissues

^{1}Yibin University Hospital, Yibin, Sichuan 644000, China^{2}Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644000, China^{3}The First Affiliated Hospital, Chongqing Medical University, Chongqing 400016, China^{4}The Affiliated Hospital, North Sichuan Medical College, Nanchong, Sichuan 637000, China

Received 4 September 2014; Accepted 26 December 2014

Academic Editor: Dominic Fan

Copyright © 2015 Hui Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Cancer diagnosis is one of the most important tasks of biomedical research and has become the main objective of medical investigations.
The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues
by combining the use of near-infrared (NIR) spectroscopy with chemometrics. The successive projection algorithm-linear discriminant analysis
(SPA-LDA) was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least
squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA)
was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training,
optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided more parsimonious model using only three
wavenumbers/variables (4065, 4173, and 5758 cm^{−1}) to achieve the sensitivity of 84.6%, 92.3%, and 92.3%
for the training, validation, and test sets, respectively, and the specificity of 100% for each subset. It indicated that the combination of
NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues.

#### 1. Introduction

Nowadays, cancer has become one of the principal causes to death of diseases [1, 2]. Great efforts have been paid for various cancer-related researches. Cancer diagnosis has become the central topic of research in cancer treatment. The conventional methods for cancer diagnosis are mainly based on the morphological appearance of the tumor tissue. The limitations for this method are the strong bias in discriminating the tumor by pathology expert and also the difficulties of differentiating between cancer subtypes [3].

Colorectal cancer is a disease of genes that control the proliferation, differentiation, and death of colon cells [4]. It has become the fourth most common cancer and continues to be the third leading cause of cancer-related deaths in both men and women, accounting for about 10% of all cancer deaths annually [5]. If colorectal cancer is found during its early stages, the 5-year relative survival rate is 90%. However, only about one-third of colorectal cancers are detected at early stages [6]. Although there are some available methods for diagnosing colorectal cancer [7], for example, serum markers, flexible sigmoidoscopy, and colonoscopy, the final result still relies on the gold standard of histopathologic diagnosis, which is time-consuming and strongly dependent on the pathologist’s subjective judgment and experience. Hence, there is an urgent need to develop simple and fast diagnostic methods.

Recent researches have demonstrated the applicability of optical spectroscopic technique for fast, noninvasive, and in situ diagnosis of various diseases including cancer. Infrared (IR) and near-infrared (NIR) spectroscopy especially have been proved to be useful tools for disease diagnosis because of their potential to probe the changes of tissues and cells at the molecular level [8]. It is known that the generation and progression of any cancers manifest themselves at the molecular level before morphologic changes emerge, which cannot be detected by traditional methods or even pathologic examinations [9]. NIR spectroscopy, as a powerful tool with practical advantages, can rapidly capture the information of chemical bonds in function groups and is therefore sensitive to changes in molecular composition and structures [10–12]. Cancer tissues differ from the normal ones in the compositions and any alterations in the compositions of the tissues can be probed and used for diagnostic purposes. NIR technique has been used in several cancer researches such as lung [13], gastric [14], esophagus [15], endometrial [16], and pancreatic [17].

However, the NIR spectrum mainly corresponds to overtones and combinations of fundamental vibration transitions that occur in the IR region and is therefore overlapping, broad, and weak and without distinct signature of individual components [18]. A NIR-based diagnostic application requires a suitable diagnostic model that can best discriminate the measured spectra from an unknown tissue. Over the years, a variety of modeling algorithms have been developed or used for optical diagnosis of cancer. Both traditional algorithms such as soft independent modeling of class analogy (SIMCA) [19] and novel algorithms such as support vector machine (SVM) [20] have been used for this purpose. The NIR spectrum comprises measurements over a large number of channels for each sample. In many cases, the responses of different instrument channels exhibit strong correlation and there exist some channels without relevant information. Thus, it is beneficial to use only a subset of channels rather than the entire set of measurements [21]. Also, such a step facilitates the interpretation of the model and is useful to guide the design of less costly instruments. Recent efforts are directed towards using variable selection to identify the best diagnostic features for obtaining a simple and easily interpreted model. In this context, Araújo et al. [22] developed the successive projection algorithm (SPA) for selecting variables in multiple linear regressions (MLR). In a subsequent work, Pontes et al. [23] extended the basic SPA to handle classification problems by merging with linear discriminate analysis (LDA), which results in the so-called SPA-LDA method. SPA-LDA has been successfully applied in various classification tasks such as coffee and soil classification [24, 25].

The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues by combining the use of NIR spectroscopy with variable selection. For this purpose, the SPA-LDA was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA) was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training, optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided a simpler and better model, which used only three wavenumbers/variables (4065, 4173, and 5758 cm^{−1}) to achieve the sensitivity of 84.6%, 92.3%, and 92.3% for the training, validation, and test set, respectively, and the specificity of 100% for each subset. It indicated that the combination of NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues.

#### 2. Theory and Methods

##### 2.1. Partial Least Squares-Discriminant Analysis (PLS-DA)

Partial least squares (PLS) regression is a classic latent variable-based multivariate calibration method. Partial least squares-discriminant analysis (PLS-DA) is a classification algorithm that combines the properties of PLS regression with discriminant analysis [26]. The outstanding advantage of PLS-DA is that the main sources of variability in the dataset are modeled by the so-called latent variables (LVs), therefore, in the associated scores and loadings, making easy the visualization and understanding of data structure and relations in the dataset. Actually, PLS-DA is a special form of PLS modeling and focuses on finding the variables and directions in multivariate space, which discriminates the known classes in the training set. If there are only two classes to separate, the PLS model uses one dummy variable, which codes for class membership as follows: 1 for samples belonging to a given class of interest and 2 for samples belonging to a different class. A discriminant model is developed by regression of the independent matrix (spectral data) on the assigned dummy variable.

The model constructed on the experimental dataset can be used to assign unknown samples to a previously defined class based on its measured features such as spectrum. Classification of a new sample is derived from the output value of the PLS model. The output value is a real number, instead of an integer, which should ideally be close to the values used to codify the class (either 1 or 2). A threshold between 1 and 2 is set so that a sample is assigned to class 1 if the predicted value is smaller than the threshold or assigned to class 2 if the predicted value is above the threshold. PLS-DA uses the appropriate number of LVs, that is, linear combinations of the original variables, to maximize the discrimination among the classes. The number of LVs can be optimized by the criterion of lowest prediction error in cross validation.

##### 2.2. Successive Projection Algorithms-Linear Discriminant Analysis (SPA-LDA)

The successive projections algorithm (SPA) is a forward variable selection method aimed at minimizing variable collinearity in modeling. It was originally developed by Araújo et al. in the context of multivariate calibration [22]. In SPA, the selection of variables is formulated as a combinatorial optimization problem with constraints. The optimization is restricted to certain subsets of variables, which are the results of a sequence of projection operations related to the matrix of instrumental responses. Therefore, the times of evaluating cost function are considerably reduced compared to an exhaustive search. In multivariate calibration, SPA is aimed at screening variables for building multiple linear regression (MLR) models. Subsequently, SPA has been extended to improve the performance of classification models of linear discriminant analysis (LDA), which easily suffered from multicollinearity among the input variables.

The combination of SPA and LDA is expressed as SPA-LDA. Figure 1 gives the flowchart of the SPA-LDA algorithm. SPA-LDA focuses at selecting a subset of variables with minimal collinearity and appropriate discriminating ability for use in classification problems. For this purpose, it is assumed that a training set consisting of samples with known class labels is available for guiding the process of variable selection. In the case of spectroscopic dataset, each sample consists of a spectrum with points (wavenumbers/wavelengths). The SPA-LDA scheme comprises two main phases [27]. In Phase 1, the training samples/spectra are first centered on the mean of each class and stacked in the form of a matrix . Each column of corresponds to a variable. Projection operations related to the columns of are then carried out to create chains with variables. Due to the loss of freedom degrees in the process of calculating class means, the chain length is limited by , where is the number of classes involved in the problem. Each time, the chain is initialized by one of the available variables. Subsequent variables are selected to the chain in order to display the least collinearity with the previous ones. The collinearity is evaluated by the correlation between the respective column vectors of . In Phase 2, different variable subsets are extracted and evaluated. For each of the chains formed in Phase 1, a total of subsets of variables can be extracted by using one up variables in the order in which they are selected. Thus, a total of subsets of variables can be generated. These candidate subsets are assessed in terms of a cost function involving the average risk of misclassification over the validation set. The cost function is defined as where In (2), the numerator is the squared Mahalanobis distance between the th validation sample and the center of its true class calculated over the training set by the formula: where is a pooled covariance matrix calculated on the training set, instead of using a separate estimate for each class. To have a well-posed problem, the number of training samples should be larger than the number of variables included in a LDA model; otherwise, the estimated will be singular, which makes it impossible to calculate the matrix inverse. The denominator in (2) corresponds to the squared Mahalanobis distance between and the mean of the nearest wrong class. A small value of indicates that is close to the center of its true class and distant from the centers of all other classes. The cost function is defined as the average of of all samples in the validation set. So, the minimization of the cost function can lead to better separation of samples of different classes.