Journal of Analytical Methods in Chemistry

Volume 2016, Article ID 5416506, 8 pages

http://dx.doi.org/10.1155/2016/5416506

## A New Local Modelling Approach Based on Predicted Errors for Near-Infrared Spectral Analysis

^{1}School of Instrumentation Science & Opto-Electronics Engineering, Beihang University, Beijing 100191, China^{2}Beijing Key Laboratory for Optoelectronic Measurement Technology, Beijing Information Science & Technology University, Beijing 100192, China

Received 10 March 2016; Accepted 7 June 2016

Academic Editor: Karoly Heberger

Copyright © 2016 Haitao Chang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Over the last decade, near-infrared spectroscopy, together with the use of chemometrics models, has been widely employed as an analytical tool in several industries. However, most chemical processes or analytes are multivariate and nonlinear in nature. To solve this problem, local errors regression method is presented in order to build an accurate calibration model in this paper, where a calibration subset is selected by a new similarity criterion which takes the full information of spectra, chemical property, and predicted errors. After the selection of calibration subset, the partial least squares regression is applied to build calibration model. The performance of the proposed method is demonstrated through a near-infrared spectroscopy dataset of pharmaceutical tablets. Compared with other local strategies with different similarity criterions, it has been shown that the proposed local errors regression can result in a significant improvement in terms of both prediction ability and calculation speed.

#### 1. Introduction

Near-infrared (NIR) spectroscopy plays an important role in the analysis of complex samples or chemical process due to its simplicity, rapidity, and nondestructive measurements [1–4]. As a result, multivariate calibration methods that relate property () and spectra () have been extensively used in the quantitative analysis of NIR spectroscopy. Many authors have stated that the choice of the appropriate calibration method is one of the key factors that influence the performance in prediction of the property of query samples [5].

The multivariate calibration methods, such as multiple linear regression (MLR), principal component regression (PCR) [6], and partial least squares (PLS) [7] regression, have been adopted for the first time [2, 8]. However, these methods are based on statistical linear models which are not always met in real-life situations and therefore are not able to efficiently model the nonlinear relationship between and . To solve this problem, different authors have demonstrated that nonlinear algorithms such as Artificial Neural Network (ANN) [9] and Least Squares Support Vector Machines (LS-SVM) [10] can produce better results than traditional linear methods especially used together with large NIR spectral libraries [11].

The approach presented in this paper is local method, which has been widely used in near-infrared spectroscopy analysis because of its advantages in terms of the simplicity of the model constructed and the ability to cope with nonlinearities [12–14]. The essential idea of local method is to develop specific calibration subsets spectrally similar to each query sample whose properties are to be predicted and build a calibration model on the selected relevant samples for the query sample.

The key issue in local learning is how the similarity criterion should be constructed. In general, the most commonly used similarity checks in NIR spectroscopy are the Euclidean distance (ED) [15], the Mahalanobis distance (MD) [16], and spectral angle mapper (SAM) [17] distance on the spectral space or principal component space [18]. Moreover, the computation of the principal components-Mahalanobis (PC-M) [19] distance has become the standard procedure for NIR distance measurements. However, the samples are usually multivariate and influenced by several compositional attributes, which are expressed as highly overlapped and nonspecific NIR absorption or reflectance [20]. For this reason, the samples that are very close in space are frequently not close (or similar) in terms of space. Therefore, traditional similarity criterion uses only information to select relevant samples, which may result in a waste of information and inaccurate sample selection [21]. In order to overcome such problem, some supervised or semisupervised methods have been developed, which take account of information of both and . Recently, another reliable similarity estimation called supervised locality preserving projection (SLPP) [22] has been successfully used in local approach to enhance the similarity measurement accuracy [21, 23].

As a simple linear approximation technique, local method could address the nonlinearity through the locally linear models. Therefore, the selection step of local method is aimed at finding some samples whose and meet linear relationship. Unfortunately, similarity criterion mentioned above can only represent the closeness but not the true linear relationship between samples in spectra and property spaces.

The objective of the work described in this paper is to develop a high-performance local method for modelling NIR spectral data. The local errors regression utilizing SLPP and similar errors strategy in finding the optimized calibration subset for each query sample is described. The main contribution of this paper is to take into account the prediction errors from global method during the selection step of local method. This step ensures that the relationships between and are highly linear in both calibration subset and query sample. After such a selection, PLS without cross-validation is adopted for local modelling as it can accelerate the prediction speed and reduce the computational complexity. With pharmaceutical tablets datasets of NIR spectra, the effectiveness and accuracy of the proposed method are investigated and compared.

#### 2. Methodology

The local errors regression consisted of two steps: the first step is the selection of relevant samples and the second step is to build a calibration model for each query sample. In this section, the details of the proposed method and other algorithms for comparison are described.

##### 2.1. Calibration Subset Selection

The main goal of this step is to discover which samples in calibration set “resemble” the query samples to be predicted. In the proposed method, the search process of calibration subset is carried out by using similarity of prediction errors which indicates how linear the spectra and property are to the samples for prediction. In this context, the parameter of error ranges need to be determined to ensure that there are sufficient selected samples to obtain reliable models. It should be noted that there might not be enough samples that are linear with the query sample to give an acceptable prediction. In this case, the definition of similarity, considering that information of both and has been adopted to construct calibration subset. Although the proposed similarity criterion should be better than traditional method, it cannot be used directly to select the relevant samples for query samples as query samples only contain the spectra information. In this paper, an SLPP technique has been employed to find the nearest sample, whose properties are approximate and interchangeable with query sample.

Since the spectra and property have different dimensions, it is not appropriate to find the nearest sample by the linear combination of EDs in spectral space and property space. Thus, the SLPP is used in this study, which not only selects most similar sample considering both spectra and property information but also reduces the computational load.

The rationale behind this approach is based on the assumption that the samples that are most close in the property space are very similar in terms of spectra space. Given a set of -dimensional calibration spectral data in with corresponding property set , -dimensional prediction spectral data with corresponding property set . Here is matrix containing spectral responses of samples; is prediction matrix containing spectral data response of samples; is matrix; and matrix is the predicted properties of prediction set with global PLS regression, being the number of components. Let spectra set . The algorithmic procedure for the calibration subset selection is stated as follows:(1)Constructing a neighborhood graph as follows: *K Nearest Neighbors*. The th and the th samples are connected by an edge if the th sample is among nearest neighbors of the th sample or the th sample is among nearest neighbors of the th sample. Here the distance between the samples is calculated in the property space.(2)* Computing the Weights*. is a sparse symmetric matrix with having the weight of the edge joining the th and the th samples, and if there is no such edge, and if and only if the th and the th samples are connected by an edge.(3)* Finding the Basis Vectors of the Subspace*. This step aims at finding a transformation matrix to project spectra set to a low-dimensional set in (). Suppose that there exists a linear transformation , where is the basis vector. The basis vector is computed by solving the following minimization problem: It is observed that the minimization problem can be calculated by solving the following generalized eigenvalue problem: where is a diagonal matrix whose entries are column sums of and ; is the Laplacian matrix. Let the column vectors be the solutions to (2) which are ordered according to their eigenvalues, . Thus, transformation matrix and low-dimensional matrix can be calculated as follows: , , and .Finally, for each query sample, the most similar sample will be selected according to the ED in the low-dimensional space .

##### 2.2. Partial Least Squares Regression

The PLS regression has been extensively employed to obtain a quantitative model for prediction of analytes based on spectral data. In this paper, the PLS is used in the procedure of local selection and the establishment of regression model. In addition, the critical step is the determination of the number of factors for achieving the best prediction. Generally, the optimum PLS factors can be determined by minimizing the prediction error of cross-validation groups. However, the cross-validation is time-consuming and does not consider the information from the query sample. For these reasons it would not be a desirable choice to local method. In this paper, the selection of calibration subset is based on the prediction error with global PLS, which are expressed as highly linear between samples in subset. Consequently, the PLS factor would have little effect on the performance of calibration models. Here the PLS factor fixed at a constant value by minimizing the root mean squared error of prediction (RMSEP) of validation dataset and the experimental verification is presented in Section 4.1.

The performance of the final calibration models was evaluated in terms of the RMSEP, Residual Predictive Deviation (RPD), and correlation coefficient () in prediction set. It should be noted that RPD is the ratio between the standard deviation of the reference data and RMSEP of prediction.

##### 2.3. Other Algorithms

In order to compare the predictive performance of our local errors method with other local approaches, the following similarity criterions were used: ED, angle distance, PC-M distance, and supervised or semisupervised methods. A brief description of these criterions is given as follows.

###### 2.3.1. Euclidean Distance

The ED between query sample and calibration sample is given bywhere is the transpose operation, , and .

###### 2.3.2. Angle Distance

In this method, the Cosine is used to evaluate the similarity between query data and in the spectral space, and the angle distance is defined as

###### 2.3.3. Mahalanobis Distance in the Principal Component

The PC-M distance is obtained through computing the Mahalanobis distance (MD) between samples in spectral principal component space. In this case, the appropriate number of principal components is determined by minimizing the root mean square of the compositional differences between property and predicted value of samples in calibration set. The Mahalanobis distance between query sample and calibration sample is defined aswhere is the covariance matrix in spectral principal component space of calibration set and and are spectra of query sample and calibration sample in spectral principal component space. For details, see [20].

###### 2.3.4. Supervised or Semisupervised Methods

*(1) Euclidean Distance Based on Both Spectra and Property*. In this method, both spectra and property are used to evaluate the similarity between query data and in the calibration dataset, and the similarity value is defined:where is a weight parameter to balance the importance of spectra and property , , and is the Euclidean distance between and .

*(2) Euclidean Distance in the Low-Dimensional Space*. Here, the similarity measurement considers both and information and utilizes SLPP technique to select relevant samples. For details, see [21].

In summary, the detailed implementation of local errors regression is described as follows.

*Step 1. *Compute the predicted properties of prediction set and predicted properties of calibration set with global PLS regression.

*Step 2. *For each query sample, select the most relevant sample from calibration set with SLPP.

*Step 3. *Predefine parameters such as PLS factors and error ranges by minimizing the RMSEP of validation dataset. Here PLS factors are set to 2, 3, 4, or 5 and error range is specified as 1.5. A much detailed description will be given in Section 4.1.

*Step 4. *For each query sample , calibration subset is selected based on predicted error similarity criteria, and the th sample in calibration set is selected if , where is the property of the th sample in calibration set.

*Step 5. *Build PLS calibration model on the selected relevant samples and predict each query sample.

It should be noted that the number of nearest neighbors and dimension of transformation matrix had little effect on the selection of the most similar sample for each query sample. Therefore, in Step 2, parameters and are set as 50 and 20, respectively. Such a selection is based on the authors’ experience without further discussion.

#### 3. Materials and Experiment

The dataset was obtained from the web of http://software.eigenvector.com/Data/tablets/index.html. The spectra were recorded by Instrument I of Foss NIR Systems 6500 Spectrometer. It contains 655 transmittance NIR spectra of pharmaceutical tablets, measured between 600 and 1898 nm with a 2 nm sampling step. The initial dataset was split into a calibration set of 460 samples and prediction set of 155 samples and validation set of 40 samples. Table 1 showed the statistical results of pharmaceutical tablets. All sets have similar averages and standard deviations which indicate that all the datasets can be used to represent the main variability of pharmaceutical tablets. A microcomputer (Lenovo) with an Intel Core 2 processor was used for all the calculations. All the algorithms were implemented using MATLAB 2012b. The raw NIR spectra of pharmaceutical tablets were presented in Figure 1.