#### Abstract

To investigate the feasibility of rapid identification and quality evaluation of Chinese medicinal centipedes using NIR spectroscopy, the qualitative and quantitative analysis models were explored. A PCA-SVC model was optimized to differentiate five species of the genus *Scolopendra*. When the model was validated with the calibration and prediction sets, the prediction accuracy was 100% and 81.82%, respectively; it can meet the requirement for rapid and preliminary identification. Based on nitrogen content detected by the chemical method, and the dimensionality of spectral data reduced with PLS, the quantitative analysis models were successfully built by PLSR and SVR algorithms. After spectra were pretreated and parameters were optimized, the performance, rationality, and prediction ability of the models were validated and evaluated with RMSECV, RMSEP, RMSEE, *R*^{2}, and RPD. Compared with the features and advantages of these two models, the PLS-SVR model had better performance and stronger prediction capacity, and it was finally regarded as the optimal quantitative analysis model to predict nitrogen content. The relative deviation between the predictive value and the reference was 2.69%, and the average recovery was 99.02%, which indicated it has potential for rapid prediction and evaluation of the quality of medicinal centipedes. This research suggested that NIR spectroscopy can be used as a rapid detection method to identify species and evaluate the quality of medicinal centipedes in China.

#### 1. Introduction

Animals of the genus *Scolopendra* are widely distributed in the world, especially in tropical and subtropical areas [1]. In China, there are 14 species which are mainly distributed in the southern region [2]. Recently, the medicinal value of centipedes had become a research hotspot; the venom was reported to be used for relieving pain and anticoagulation [3, 4]. As an important source of Chinese medicinal materials, five species of *Scolopendra* are commonly used in China [2], which were reported to possess analgesic [3], anti-inflammatory [5], and antitumor [6, 7] activities and to improve blood rheology [8]. However, *Scolopendra mutilans* is the only species recorded in Chinese Pharmacopoeia 2015 (ChP 2015), and the other four species are just used in local regions; for instance, *S. multidens* is used in Guangxi and *S. mojiangica* in Yunnan [2]. As the animals of *Scolopendra* are poisonous, the venom and toxic ingredients of some species will bring high risk to humans [9], and the activity and quality of species are different. Therefore, in order to ensure safety and clinical efficacy, a simple, rapid, and accurate method is needed to identify the species and control the quality of medicinal centipedes.

Previously, medicinal centipedes were mostly identified using morphological description, but some similar characteristics were probably shown among closely related species. If samples were damaged or powdered, they were difficult to be identified, and confusion and misuse would be unavoidable. Presently, molecular methods are gradually applied to identify *Scolopendra* species [10]. However, the complexity of operation and high technical requirements make it difficult to obtain rapid and accurate results, especially in mixed samples. Proteins or amino acids are recognized as the main active ingredient of medicinal centipedes [11–13], and their content is usually used for quality evaluation. Because of the diversity of ingredients, and the complexity of chemical determination methods, this measurement is usually cumbersome [14].

Near-infrared (NIR) spectroscopy combined with chemometrics is a fast, nondestructive, and environmentally friendly analysis technique that can realize multicomponent analysis. Nowadays, it is widely used in agriculture and medicine [15–17]. NIR spectroscopy mainly reflects the absorption of overtone and combination peaks containing hydrogen bonds of C-H, O-H, and N-H [16]. Lipids and proteins are considered to be important medicinal components, which are rich in centipedes and show characteristic absorption in the near-infrared region. However, the application of NIR spectroscopy on medicinal centipedes has not yet been reported. In this study, the NIR spectroscopy analysis methods were investigated, a PCA-SVC model was explored to identify the species of *Scolopendra*, and in light of nitrogen content determined by the chemical method, a quantitative model was established for quality prediction using regression algorithms.

#### 2. Materials and Methods

##### 2.1. Instruments and Software

The nitrogen content of samples was determined with the DK 20 Heating Digester (VELP, Italy) and UDK 149 Automatic Distillation Unit (VELP, Italy). Spectra were collected with an MPA FT-NIR spectrometer (Bruker Optics Co., Ltd., Germany) and analyzed using the OPUS 7.5 spectrum analysis software (Bruker), MATLAB R2014a data analysis software (MathWorks, Inc., USA), and Unscrambler 9.7 data analysis software (CAMO Software AS, Norway).

##### 2.2. Samples and Identification

A total of 64 samples from 28 batches have been collected from field surveys or market commodity in China since 2015. All samples were identified into five nominal species according to characteristics recorded by Siriwut et al. [1], Kang et al. [2], Song et al. [18], and Zhang and Wang [19]. The sample information is summarized in Table 1. All samples were kept below −20°C and housed in Hubei University of Chinese Medicine, Wuhan, China.

##### 2.3. Content Determination

After being scanned with a near-infrared spectrometer, the nitrogen content of 50 mg powder of each sample was determined with the semimicro quantitative nitrogen determination method referring to the guideline of ChP 2015. The samples were digested using the DK 20 Heating Digester with a program as follows: 200°C for 5 min, then up to 260°C sustaining for 5 min, 340°C for 5 min, and 420°C for 40 min, and at last cooled down to 200°C. The sample solution was measured using the UDK 149 Automatic Distillation Unit with a program as follows: 50 ml H_{2}O and 20 ml 40% NaOH were added to the digested solution, 20 ml 2% H_{3}PO_{4} was used for receiving, the steam quantity was 50%, the distillation time was 4 min, and then titration was done with 0.025 mol/L H_{2}SO_{4} standard solution (Metrological Testing Technology Research Institute of Shanghai; Batch number 150901).

##### 2.4. Spectra Acquisition

After samples were smashed and dried at 55°C for 24 h, the powder of 2 g of individuals was scanned using the MPA FT-NIR spectrometer with a diffuse reflection integral sphere. The spectra were obtained in a range of 12000∼4000 cm^{−1} by the coaddition of 32 scans at a resolution of 8 cm^{−1}. Each sample was scanned three times, and the average of three spectra was used for analysis. The spectra diagram is shown in Figure 1.

##### 2.5. Spectral Pretreatment Method

Usually, the raw spectrum includes a lot of irrelevant information or noise, which would lead to baseline drift and instability. Therefore, spectrum pretreatment is a critical step in spectral analysis. There are many pretreatment methods, and each has advantages to improve model performance. For instance, vector normalization (VN) can be used to eliminate influences of the optical path change on the spectrum. The derivative methods including the first derivative (FD) and second derivative (SD) are always employed to eliminate spectral difference from baseline [20], while multiple scattering correction (MSC) is commonly performed to process diffuse reflection spectra [21]. In this study, methods such as VN, FD, SD, and MSC or combined pretreatments were employed by OPUS to optimize model performance.

##### 2.6. Spectral Data Compression Method

###### 2.6.1. Principal Component Analysis (PCA) Method

PCA is a commonly used method for data compression. It performs dimensionality reduction of a high-dimensional dataset, while retaining its variation as much as possible. This method can transform a number of possibly correlated variables (the original data matrix) into one or a few important variables (principal components (PCs)) to reveal the internal structure. Each PC is a linear combination of the original data. The new variables are not related to each other, which can eliminate the overlapped part of information. Moreover, these new variables include the most informative dimensions of the original variables without losing too much information. Commonly, the number of PCs is determined by the contribution rate to original variables. When the cumulative contribution rate is more than 85%, the main components can represent most of the information provided by the original variable [22]. In our identification research, the PCA was used to reduce the dimension of original spectral data.

###### 2.6.2. Partial Least-Squares (PLS) Method

The PLS is a new multivariate statistical analysis method. It attempts to recombine the original variables (mainly continuous variables) into a group of new independent comprehensive variables and extracts a few comprehensive variables to reflect the information on the original variables as much as possible. The extracted new variables have good interpretation ability for the dependent variables. During modeling, it not only considers factors of the independent variable matrix (spectral matrix) but also takes the “response” matrix (content matrix) into account. The principal component scores extracted by dimension reduction are used as input variables to avoid multicollinearity, improve stability, and simplify the model. Therefore, the PLS has the ability to simplify the model and characteristics of quick calculation and strong prediction ability, and as one of the most classical data processing tools in multiple correlation regression, it is widely applied in NIR spectroscopy quantitative analysis [23–25].

##### 2.7. Support Vector Machine (SVM) Algorithm

SVM is a powerful supervised learning algorithm that was first proposed by Vapnik [26] and successfully extended by researchers in recent years. It is based on the principle of minimization of structural risk in constructing an optimally separating hyperplane that separates different classes of data. In the process, input vectors are mapped to a newly constructed high-dimensional space, and then parallel hyperplanes are constructed to maximize the interplane distance which separates the data. The SVM can solve nonlinear problems in a higher-dimensional space based on radical basis function (RBF) to construct a linear function. Usually, the SVM algorithm includes both support vector machine for classification (SVC) and support vector machine for regression (SVR); the former is used to solve problems of classification, while the latter is used for regression analysis.

RBF is a commonly used kernel function in the SVM algorithm. It has a strong ability to deal with nonlinear problems. It can be expressed as follows:

RBF has two important parameters in the SVM algorithm, i.e., penalty factor “*C*” and kernel function parameter “,” which have a great influence on model prediction ability, and the values should be determined during the model optimization process. Commonly, the optimization methods include the grid search (GS), particle swarm optimization (PSO), and genetic algorithm (GA). The PSO algorithm simulates the flight foraging behaviors of bird clusters through collaboration among birds to achieve the best objective. The GA is an operation based on biological natural selection and genetic mechanism to realize the optimal result, while the GS iterates through every intersection in the grid to find the best combination of parameters (*C*, ) and makes cross-validation most accurate. The RMSECV was used to guide the optimization of internal parameters.

During modeling of the SVM algorithm, the input data need to be mapped to a higher-dimensional space to realize dimension reduction and regression fitting. So, the data should be firstly pretreated and compressed.

##### 2.8. Model Validation and Evaluation

###### 2.8.1. Qualitative Model

In the PCA-SVC qualitative model, the model performance was evaluated by 3-fold cross-validation (3-CV) of the calibration set. The internal parameters *C* and were optimized with the GS method. When the accuracy of 3-CV reached maximum values, the optimal *C* and were determined. After models were established, the calibration set and prediction set were input to the model, and the prediction accuracies were used as indexes to evaluate the prediction ability.

###### 2.8.2. Quantitative Model

During the process of modeling, the calibration set was used for internal cross-validation to validate model performance, the internal cross-validation adopted 6-fold cross-validation, and the root mean square error of internal cross-validation (RMSECV), coefficient of determination (*R*^{2}), and ratio of performance to deviation (RPD) were taken to guide the model optimization process. The predication set was used for external validation to evaluate the model, with the root mean square error of prediction (RMSEP), *R*^{2}, and RPD taken as indexes to evaluate prediction ability. Generally, the smaller the RMSECV and the larger the *R*^{2} are, the better the model performance would be; the smaller the RMSEP and the greater the *R*^{2} are, the stronger the model prediction ability is. Moreover, when RPD >2, it indicates that the model has excellent reliability. After the model was established, the calibration set used for full cross-validation was input to the model again, and the root mean square error of evaluation (RMSEE) was used as an index to further evaluate the reasonability of the model. Theoretically, the RMSECV value was higher than the RMSEE value, which indicated that the modeling process was reasonable and feasible.

#### 3. Results and Discussion

##### 3.1. Determination of Nitrogen Content

The content of 64 samples was measured. Samples used in the analysis are as follows: 25 specimens of *S. mutilans*, 13 of *S. multidens*, 9 of *S. mojiangica*, 8 of *S. negrocapitis*, and 9 of *S. dehaani*. The nitrogen content of species was between 8.19% and 12.27%, the total mean was 10.0%, and the mean value of each species was 10.25%, 11.04%, 8.19%, 10.58%, and 12.27%, respectively. The results are shown in Table 1.

##### 3.2. Analysis of NIR Spectra

The NIR spectra of samples were scanned in the range of 12000–4000 cm^{−1}; the spectra diagram is shown in Figure 1. It indicated that the characteristic wavenumber was mainly in the range of 9000 to 4000 cm^{−1}, while the spectral characteristics showed high similarity among samples that it was difficult to distinguish species from peak data. Hence, the chemometric method was needed for spectral pretreatment and characteristic information extraction in qualitative and quantitative analysis.

##### 3.3. Qualitative Model Based on PCA-SVC Algorithm

###### 3.3.1. Sample Classification

The spectra of 64 samples were randomly classified into calibration and prediction sets in a proportion of approximately 2 : 1. Finally, 42 samples of the calibration set were used for model establishment, and 22 samples of the prediction set were used for model evaluation. The species were represented with category label numbers 1 to 5. The classification information is shown in Table 2.

###### 3.3.2. Optimization of Pretreatment

In this qualitative analysis, the three methods VN, FD, and SD were used to pretreat the raw spectra. The PCA method was used to reduce dimensions of raw and three pretreated spectra. The accumulative contribution rates of PCs were calculated. The result showed that the contribution rates of the first two PCs (PC1 and PC2) were more than 85%, which can represent most of the spectrum information [22]. Hence, the PC1-PC2 correlation diagram was attempted for a preliminary investigation to differentiate the samples. However, most species overlapped together in space and cannot be discriminated, except *S. mojiangica* in the FD and SD.

To further investigate the influence of different pretreatments, a group of PCA-SVC models was established using the scores of the first 2 PCs as input variables and category labels as output variables. The model performance was evaluated by 3-fold cross-validation (3-CV) of the calibration set. The internal parameters of the SVC algorithm were optimized with the GS method. The values of best *C* and were obtained based on the initial optimization in a range of and and in steps of 5 and then the second fine optimization in an adjusted narrow scope. When the 3-CV accuracy reached the maximum, the optimal values of *C* and were determined. After models were established, the calibration set and prediction set were input to models and predicted, and the prediction accuracies were used to evaluate the prediction ability. As shown in Table 3, it can be seen that “overfitting” and mismatching existed in the raw spectrum model for high accuracy (90.48%) in the calibration set and low accuracy (59.09%) in the prediction set. In contrast, the other three models VN, FD, and SD had nearer accuracies between calibration and prediction sets to possess a relatively rational structure, although the values were not even high. Also, the model of the SD had the highest accuracy among all pretreatments, whether in the calibration or prediction set. Therefore, the SD was regarded as the best pretreatment for its better prediction ability.

###### 3.3.3. Optimization of the Number of Principal Components (NPC)

Although the SD was determined as the optimal pretreatment in a preliminary investigation, the accuracy in the model with scores of the first 2 PCs as input variables was just about 70%, which did not meet the requirement of discrimination. Hence, the best NPC still needs to be optimized. In light of the modeling and SD pretreatment method mentioned above, 10 PCA-SVC models (SVC-5 to SVC-14) were established using the scores of the first 1, 2, 3, …, 10 PCs of the calibration set as input variables. As shown in Table 4, the accuracy improved with the increase of the NPC. When the NPC was 8, the accuracy in the calibration set was 100% and in the prediction set was 81.82%. When the NPC was higher than 8, the accuracy in the calibration set was 100%, while the accuracy in the prediction set did not increase or even decreased. Therefore, number 8 was considered the best NPC, and model SVC-12 was the optimal qualitative model.

###### 3.3.4. Validation and Evaluation of PCA-SVC Model

According to the research above, SVC-12 was determined as the best qualitative analysis model. After the full spectrum was pretreated with the SD and the dimension was reduced with PCA, the model was established using the scores of the first 8 PCs as input variables and category labels as output variables. The internal parameters of best *C* and were optimized with the GS. The initial search was in a range of and and in steps of 5, and then the fine search was in a range of [20, 23] and [12, 16] and in steps of 0.5; when *C* = 5.93164 10^{6} and = 5792.62, the accuracy of 3-CV was 83.33%. When the model was predicted with the calibration set and prediction set, the accuracy was 100% (42/42) and 81.82% (18/22), respectively, which might be accepted for rapid identification. The optimizations of internal parameters are shown in Figure 2, and the predictive results are shown in Figure 3.

**(a)**

**(b)**

**(a)**

**(b)**

##### 3.4. Quantitative Model Based on PLSR and PLS-SVR Algorithms

###### 3.4.1. Partition of Sample Set

In this quantitative analysis, the Kennard–Stone (K-S) algorithm was used to divide 64-sample spectra into the calibration set and prediction set in a proportion of 2 : 1 in the MATLAB R2014a software; 42 samples of the calibration set were used for validation, while 22 samples of the prediction set were used for prediction.

###### 3.4.2. PLSR Model

The partial least-squares regression (PLSR) model is one of the multiple linear regression (MLR) models; it can easily realize the ideal linear relationship between input variables (spectral information) and output variables (ingredient contents) after high dimensions are compressed by PLS. PLSR has the desirable property to analyze data that are strongly collinear (correlated), noisy, and independent variables and also simultaneously model several response variables; now, it has been developed as a standard tool in chemometrics [27].

The full spectral data (12000∼4000 cm^{−1}) were used for modeling. To eliminate noise and other factors, they need to be firstly pretreated. The pretreatments including Raw, VN, FD, FD + VN, MSC, and FD + MSC were applied. After the dimensions were reduced with PLS, the treated spectral data were used as input variables and nitrogen content was the output reference, and a series of PLSR models were established with the Unscrambler 9.7 software.

During the process, the model was validated and evaluated. As shown in Table 5, the RPDs were all over 2 to possess model reliability. All the values of RMSECV were higher than those of RMSEE, which indicated the feasibility of the models. Models PLSR-1, PLSR-4, and PLSR-6 had lower RMSECV and higher *R*^{2} to present good performance, while PLSR-1 had great RMSEP, minimum *R*^{2}, in external validation, and the largest NPC, and its structure is unreasonable. There was no significant difference in performance and prediction ability between PLSR-4 and PLSR-6, but considering the rationality of pretreatment, RMSECV, and RMSEE, the PLSR-6 model was considered the best model.

The optimization of the NPC is an important step during modeling. It can be obtained from the RMSECV-NPC diagram. For instance, in the PLSR-6 model, with the change of the NPC, the RMSECV had different values; when the NPC was 5, the RMSECV had a minimum value, and the model had the best performance. Therefore, the optimal NPC was determined as 5. The optimization is shown in Figure 4. The regression equation of the PLSR-6 model between the principal component scores (SPL_{i}, *i* = 1, 2, …, 5) and the nitrogen content is expressed as follows:

As described above, the best PLSR model was finally determined, the optimized pretreatment was determined as FD + MSC, and the NPC was 5. During modeling, 6-fold cross-validation was used as internal validation to validate the performance, and the predictive ability was evaluated with external validation using the prediction set. The predictive results are shown in Figure 5. The average relative deviation between the predictive value and the reference was 2.71%, and the average recovery was 98.77%.

**(a)**

**(b)**

###### 3.4.3. PLS-SVR Model

Besides ingredient information, the NIR spectroscopy also contains much other information, such as physical and chemical information, which often causes spectral bands seriously overlapped. Actually, in most cases, it shows nonlinear relationship between sample spectra and content. With the development of application of chemometrics, modern intelligent algorithms have attached more attention to NIR spectroscopy analysis for its strong nonlinear fitting ability and obtained preliminary exploration and application. The SVM algorithm is based on statistics to allow obtain a good fitting effect and stable structure. As a result, it becomes a commonly used nonlinear regression algorithm. Compared with the ANN algorithm which is suitable for solving problems of complex mapping and large sample size [28], the SVR model has undergone much application to become a relatively mature model, and it is suitable for small sample size.

In this study, an SVR algorithm combined with dimensions reduced by the PLS was used to establish a nonlinear regression model. When the parameters determined in the PLSR model (the pretreatment was FD + MSC, and dimensions reduced with PLS and NPC were 5) were introduced into the SVM algorithm, the SVR models were performed in the MATLAB R2014a software. The GS and GA were adopted to optimize the internal parameters (*C*, ). The model performance was validated with 6-fold cross-validation using the calibration set, and the prediction ability was evaluated with external validation using the prediction set. As shown in Table 6, the RMSECV in PLS-SVR-2 was 0.34, which is less than 0.4 in PLS-SVR-1, and the *R*^{2} in PLS-SVR-2 was 93.29, which is larger than 91.54 in PLS-SVR-1. Therefore, the PLS-SVR-2 model with internal parameters (*C*, ) optimized with the GS had relatively excellent performance, and it was regarded as the suitable SVR model. The optimization of internal parameters (*C*, ) and predictive results are shown in Figure 6.

**(a)**

**(b)**

###### 3.4.4. Analysis and Evaluation of Quantitative Models

In this study, the linear regression model of PLSR and nonlinear regression model of PLS-SVR were successfully established. As shown in Tables 5 and 6, the prediction ability had no significant difference between the two models. The relative deviations between the predictive value and the reference were 2.71% and 2.69%, respectively, and the average recoveries were 98.77% and 99.02%. Totally, the two optimized models had a reasonable structure and good prediction ability. Both of them could meet the requirements of accuracy and precision of quantitative analysis and could be used for nitrogen content analysis and quality evaluation of medicinal centipedes.

However, the PLSR model was built based on a linear regression algorithm to have characteristics of fast fitting and simple calculation, when the analysis requirements were not too high, and it would be widely used. In contrast, the SVR model was established based on the nonlinear regression algorithm, and it had the strong nonlinear fitting ability. It was shown from Tables 5 and 6 that the values of RMSECV and RMSEP of SVR models were generally less than those of PLSR models, and the values of *R*^{2} of SVR models were commonly higher than those of PLSR models. It was indicated that the PLS-SVR model had perfect performance and better prediction effect than the PLSR model. For this reason, the PLS-SVR model now becomes the most widely used regression model in NIR spectroscopy analysis. Therefore, the PLS-SVR model with internal parameters (*C*, ) optimized with the GS was considered the most suitable NIR spectroscopy quantitative model for nitrogen content analysis of medicinal centipedes, and the PLSR model can act as supplement and verification for the analysis.

#### 4. Conclusions

This study was carried out to explore the feasibility of using the NIR spectroscopy method to rapidly differentiate species and evaluate the quality of Chinese medicinal centipedes. In the qualitative analysis, after spectra were pretreated with the SD, dimensions were reduced with PCA, and internal parameters were optimized with the GS algorithm, a PCA-SVC model was set up using the scores of the first 8 PCs as input variables and category labels as output variables. The optimal model (SVC-12) was validated and evaluated, which could identify five species of medicinal centipedes with an accuracy of 100% (42/42) in the calibration set and 81.82% (18/22) in the prediction set. It could be accepted as an objective, rapid, and auxiliary method for identifying the species of medicinal centipedes. Through the spectra pretreated with FD + MSC, data dimension reduced with PLS, and NPC determined as 5, two best quantitative models of PLSR and PLS-SVR were also successfully determined. During the process of modeling, the RMSECV, *R*^{2}, and RPD of 6-fold internal cross-validation in the calibration set indicated the better performance and stronger modeling capacity. The RMSEP, *R*^{2}, and RPD of external validation in the prediction set proved stronger prediction ability. In addition, to investigate the reasonability of the model, the calibration set used for full cross-validation was input to the models again, and the RMSEE was used as an index. Comparing the characteristics and advantages of two different regression algorithms, the PLS-SVR-2 model had excellent performance and strong prediction capacity, and it was finally considered the most suitable quantitative model of NIR spectroscopy for nitrogen content analysis of medicinal centipedes.

Meanwhile, the pretreatment methods were also optimized in this paper; although the SD was determined in the qualitative model, MSC or its combined methods were applied to pretreat the spectra in quantitative models. The MSC had advantages of weakening or eliminating interference caused by the uneven grain size of solid powder in the diffuse reflection spectrum [29]. In this research, all samples were smashed into powder, and the NIR spectra were obtained with a diffuse reflection spectrum, so the application of FD + MSC was proved to be reasonable.

This study indicated that NIR spectroscopy combined with chemometric algorithms could be successfully used to differentiate species and evaluate the quality of medicinal centipedes in China, which was characterized with rapid, nondestructive, and environmentally friendly properties. However, this study just represented preliminary exploratory research; although 28 batch samples and 64 individuals were conducted, the sample size was still limited. In the future, more samples will be used to improve the prediction ability, and other algorithms will also be considered to simplify the model and improve performance. This study also provided a reference for rapid identification and quality analysis of other animal medicinal materials using NIR spectroscopy.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by a grant from the major drug discovery projects of the National Ministry of Science and Technology of China (no. 2014ZX09304307001).