Journal of Chemistry

Volume 2015 (2015), Article ID 619685, 8 pages

http://dx.doi.org/10.1155/2015/619685

## Cancer Discrimination Using Fourier Transform Near-Infrared Spectroscopy with Chemometric Models

^{1}Key Lab of Process Analysis and Control of Sichuan Universities, Yibin University, Yibin, Sichuan 644000, China^{2}Yibin University Hospital, Yibin, Sichuan 644000, China^{3}The First Affiliated Hospital, Chongqing Medical University, Chongqing 400016, China

Received 9 March 2015; Revised 25 May 2015; Accepted 27 May 2015

Academic Editor: Iciar Astiasaran

Copyright © 2015 Hui Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Near-infrared (NIR) spectroscopy technique offers many potential advantages as tool for biomedical analysis since it enables the subtle biochemical signatures related to pathology to be detected and extracted. In conjunction with advanced chemometrics, NIR spectroscopy opens the possibility of their use in cancer diagnosis. The study focuses on the application of near-infrared (NIR) spectroscopy and classification models for discriminating colorectal cancer. A total of 107 surgical specimens and a corresponding NIR diffuse reflection spectral dataset were prepared. Three preprocessing methods were attempted and least-squares support vector machine (LS-SVM) was used to build a classification model. The hybrid preprocessing of first derivative and principal component analysis (PCA) resulted in the best LS-SVM model with the sensitivity and specificity of 0.96 and 0.96 for the training and 0.94 and 0.96 for test sets, respectively. The similarity performance on both subsets indicated that overfitting did not occur, assuring the robustness and reliability of the developed LS-SVM model. The area of receiver operating characteristic (ROC) curve was 0.99, demonstrating once again the high prediction power of the model. The result confirms the applicability of the combination of NIR spectroscopy, LS-SVM, PCA, and first derivative preprocessing for cancer diagnosis.

#### 1. Introduction

Cancer is a type of gene diseases in which cells grow and divide uncontrollably [1]. In the last years, cancer has become one of the leading causes of death [2]. It is estimated that about 5.2 million patients die from cancer per year in the world and more than half million patients die in developing countries [3]. One of focal tasks of cancer-related researches is to develop inexpensive, accurate, fast, and convenient diagnostic methods to improve the survival probability. As far as cancer diagnosis is regarded, histopathological evaluation is still the gold standard [4]. It involves the sampling of tissues and examination by pathologists. Such a procedure includes tissue staining and morphological pattern recognition. In the process of tissue transformation, substantial modifications at molecular level are expected to occur before apparent visible morphological changes. Moreover, histopathological evaluation mainly depends on expert’s subjective judgment and experience and its result is therefore subjective to some extent [5]. There is an urgent need to develop novel and objective methods for cancer diagnosis. In recent years, great emphasis has been placed on seeking diagnostic methods that may enhance the ability to differentiate between normal and cancerous tissues.

Various molecular spectroscopes as powerful tools for investigating chemical changes at molecular level have been applied to detect cancer [6–8]. For example, infrared spectroscopy (IR) techniques are widely applied to analyze biological tissues; the resulting spectra are composed of characteristic bands originating from all vibration modes of bimolecular components present in the tissue including proteins, lipids, and nucleic acids. Each biomolecule gives a characteristic IR spectrum, which contains infrared information of various functional groups in the molecule. The frequency and intensity of spectral bands are related to the vibration frequencies and polar properties as well as relative concentration. A total spectrum of tissue can convey information of characteristic changes in molecular composition and structure accompanying transformation from normal to cancerous state.

In recent years, near-infrared (NIR) spectroscopy has caught the attention for biomedical study of several diseases including cancers [9–12]. In particular, the advance of fiber optic probes provides the possibility of applying NIR technique to noninvasive and minimally invasive ways of diagnosing disease. NIR spectroscopy can be partitioned into intervals of the short-wave and long-wave NIR [13]. The short-wave NIR region mainly provides information concerning tissue blood flow and oxygen saturation and consumption since the heme proteins and cytochromes dominate the spectrum. The long-wave NIR region corresponds to the combination and overtone vibrations of C-H, N-H, and O-H groups and thus provides information on the chemical composition of tissues. The NIR spectrum of a tissue actually reflects the structures and concentrations of various components. Cancerous tissues differ from normal parts in composition, physiology, and biochemistry. Any alteration in the composition of the tissues can thus be detected and used for diagnostic purposes. Some researchers have applied NIR spectroscopy to the studies on gastric [14], breast [15], prostate [16], pancreas [17], and colorectal [18] tissues. Unlike IR spectrum, the NIR spectrum corresponds to the overtones and combination of various molecules; a tissue sample often leads to a complex NIR spectrum, which consists of many overlapping, broad, weak, and nonspecific bands. Thus, when applying NIR to cancer diagnosis, it is inevitable to resort to appropriate algorithms to extract information and to build a discrimination model. Additionally, for a NIR-based application, the performance of a model, such as the accuracy and robustness, is decisive for its availability. No single algorithm performs best for all cases. Various algorithms including -Nearest Neighbor Classifier [19], linear discriminate analysis (LDA) [20], soft independent modeling class analogy (SIMCA) [21], cluster analysis (CA), and support vector machine (SVM) [22] have been used for these purposes. In particular, SVM algorithm has become a powerful tool for nonlinear classification, function estimation, or nonlinear regression and has led to many successful applications. Compared to other methods, the main advantage of SVM is that there are very few parameters to tune or select a priori. Least-squares SVM (LS-SVM) algorithm is a simplified and improved version of traditional SVM [23]. It encompasses similar advantages as SVM and its additional advantage is that it only needs to solve a set of linear equations, which is much easier and computationally simpler than nonlinear equations employed by traditional SVM.

Based on the fact that NIR spectroscopy is a powerful tool to detect changes at the molecular level, the present work focuses on investigating the potential of NIR spectroscopy and classification models for discriminating colorectal cancer from normal ones so as to build an objective procedure for cancer diagnosis. Three spectral preprocessing methods and their combination were attempted and LS-SVM was used to construct classification models. The sensitivity and specificity were used as performance index. The receiver operating characteristic (ROC) curve was also used for evaluation purposes.

#### 2. Theory

##### 2.1. Support Vector Machine (SVM)

Support vector machine (SVM) is a state-of-the-art classification algorithm [24], which has a good theoretical foundation in statistical learning frame. The SVM optimizes the classification decision function based on structural risk minimization instead of traditional empirical risk minimization and can therefore avoid the problem of overfitting. Due to its advantages, SVM has gained extensive application in both classification and regression tasks. For a binary classification problem, SVM focuses on finding maximal margin hyperplanes in terms of a subset of the input data, that is, support vectors, between different classes. When the input data are not linearly separable, it firstly maps them into a high-dimensional feature space and then classifies them by the maximal margin hyperplanes. Least-squares SVM (LS-SVM) is a modified version of SVM. When realizing the algorithm, LS-SVM can result in a set of linear equations instead of a quadratic programming problem and significantly broaden the applications of SVM. A number of excellent introductions to both SVM and LS-SVM can be found in literatures. Here, only a brief description of the main idea of LS-SVM was provided.

##### 2.2. Least-Squares Support Vector Machine (LS-SVM)

Given a binary classification problem, samples , , where is the vector of input pattern for the th sample and is the corresponding class label, the pattern represented by the subset belongs to class 1 and the pattern represented by the subset belongs to class 2. A final LS-SVM classifier corresponds to the optimal separating hyperplane, which can be found by solving the following minimization problem [23, 25]:subject to the equality constraints for the training setwhere is a special map mapping the input space to a high-dimensional feature space where samples become linearly separable by a hyperplane defined by the pair (, ), is the relative weight of the error term, which can determine the tradeoff between minimizing the training error and minimizing model complexity, and is the error variables taking noise into account and avoiding poor generalization. Obviously, LS-SVM considers equality type constraints instead of inequalities as in the classic SVM approach. The modification greatly simplified a problem such that the LS-SVM solution follows directly from solving a set of linear equations rather than from a convex quadratic program.

It can be noted that the constraints of (1) are equality constraints, but it is an inequality in SVM. The corresponding Lagrangian for (1) iswhere are the Lagrange multipliers. According to [23], the optimality condition leads to the following linear system:where , , .

Mercer’s condition is applied within the matrix :Thus, only kernel function is needed in the training algorithm. It would never need to explicitly even know what is. The LS-SVM classifier can be constructed as follows:

#### 3. Experimental

##### 3.1. Sample Preparation

All original tissues from patients who suffered from malignancies of the colon were supplied by the Affiliated Hospital of North Sichuan Medical College and the First People’s Hospital of Yibin City. All the colorectal tissues were stabilized in 4% formaldehyde solution and then dehydrated by a series of ethanol solutions with a concentration gradient. The samples were put into xylene, embedded in paraffin wax, and then sliced in sections with the thickness of 4 *μ*m. The tissue samples from patients with colorectal cancer were carefully selected to include both normal and malignant areas. By this means, a total of 107 tissue samples of patients, thirty-nine from females and sixty-eight from males, were selected. Considering that it was difficult to collect the spectra of too small samples, these samples were ignored; as a result, only 107 cancer tissue specimens and 55 normal specimens were used for experiment. The mean age of the patients in this study was 63.4 years. The oldest patient was 89 years old and the youngest one was 25 years old. Diagnosis was confirmed by a pathologist by histopathological examination.

##### 3.2. Instrument and Measurement

The NIR spectroscopy studies were performed by a Fourier transform near-infrared (FT-NIR) spectrometer (Thermo Fisher, USA) coupled with an InGaAs detector. The paraffin sections were placed on the integrating sphere of the NIR spectrometer. Assisted by a special mirror, the NIR diffuse reflection spectra were collected in the range from 10,000 to 4,000 cm^{−1}. Each spectrum is the co-adding of 64 scans and the resolution was set as 2 cm^{−1}. As a result, each spectrum consisted of 3112 variables. The data format was set as . The collection procedure was controlled by Result software of Thermo Fisher. Based on air reference, the background spectrum was recorded every 1 h. To simplify the calculation, the number of variables was reduced to 1137 by taking the average of every three adjacent points.

##### 3.3. Dataset Partitioning

Each tissue slide corresponds to a spectrum and was assigned a label (1 for cancer and 2 for normal). A spectrum and a label constitute a so-called sample. In this study, a total of 107 and 55 samples belong to cancer category and normal category, respectively. In order to construct and validate the discrimination models, the dataset was divided into two independent subsets, that is, the training and test sets, by Kennard-Stone (KS) algorithm [26] coupled with alternate sampling. First, all cancer spectra/samples were sorted by KS algorithm, which first picks out two spectra that are most distant from each other as starting points. Subsequently, at each step, it selects the spectra that exhibit the largest minimum distance to any sample already selected. By this means, a sequence can be generated. Then, the sequence is alternately sampled. As a result, the training set and the test set have the same number of samples. Similarly, all normal spectra can be assigned to the training and test sets. All calculation and modeling were performed in MATLAB 7.0 for Windows by using LS-SVMlab1.5 toolbox.

#### 4. Results and Discussions

##### 4.1. Analysis of Spectral Profile

For diagnosing applications prior to pattern classification data preprocessing is essential for enhanced performance of the classification model. There are a number of mathematical techniques for preprocessing spectral data. However, due to the lack of general guidelines, to determine the appropriate preprocessing technique is still trial and error. In this study, SNV, first derivative, and PCA were considered. Figure 1 shows the first derivative-based and SNV-based preprocessed spectra and mean spectra of cancerous and normal samples. The first derivative preprocessing can lead to a derivative spectrum, which is the differentiation of the original spectrum. The main advantage of the derivative spectrum is to highlight and resolve overlapped spectral bands. SNV, a mathematical transformation of the original spectra, is capable of removing the slope variation and correcting scatter effects. It is noticeable that visual spectral difference between cancerous and normal samples was very small. The SNV-based preprocessed spectra can remove unwanted background variances to some extent. However, as shown in Figure 1(d), the subtle difference of mean spectra is maybe not enough to construct a satisfactory classification model. In Figure 1(c), the difference between cancerous and normal samples is more easily perceived in the region of 5500–7000 cm^{−1}; the differences include peak shape and intensity. These peaks in such a region can be ascribed to the first overtone of C-H stretching (5500–6000 cm^{−1}), first overtone of O-H, N-H bonds, and C-H combinations. Such results are reasonable because the main difference in the composition between cancerous and normal tissues is mainly from DNA, water, protein, and lipids. In fact, most of the bands in the NIR region were originated from the vibration modes of various functional groups in the molecules of cellular constituents in tissues and cells. A NIR spectrum was actually a mixture of the signatures of many components such as water, proteins, lipids, and carbohydrates. The spectrum of a cancerous tissue was different from the normal one, which was the result of characteristic structural alterations at the molecular level. These alterations include an increase of the ratio of nuclei to cytoplasm, an increase of the relative DNA amount, an enhancement of phosphorylation in proteins, a decrease of the relative RNA amount, and a loss of hydrogen bonding of C-OH in the amino acid residues of proteins; therefore, for a given biological sample, NIR instrument can generate a characteristic spectrum, which must contain a vast amount of potentially useful diagnostic information.