Rapid Identification of Nine Easily Confused Mineral Traditional Chinese Medicines Using Raman Spectroscopy Based on Support Vector Machine
Mineral traditional Chinese medicines (TCMs) are natural minerals, mineral processing products, and some fossils of animals or animal bones that can be used as medicines. Mineral TCMs are a characteristic part of TCMs and play a unique role in the development of TCMs. Mineral TCMs are usually identified according to their morphological properties such as shape, color, or smell, but it is difficult to separate TCMs that are similar in appearance or smell. In this study, the feasibility of using Raman spectroscopy combined with support vector machine (SVM) for rapid identification of nine easily confused mineral TCMs, i.e., borax, gypsum fibrosum, natrii sulfas exsiccatus, natrii sulfas, alumen, sal ammoniac, quartz, calcite, and yellow croaker otolith, was investigated. Initially, two methods, characteristic intensity data extraction and principal component analysis (PCA), were performed to reduce the dimensionality of spectral data. The identification model was subsequently built by the SVM algorithm. The 3-fold cross validation (3-CV) accuracy of the SVM model established based on extracting characteristic intensity data from spectra pretreated by first derivation was 98.61%, and the prediction accuracies of the training set and validation set were 100%. As for the PCA-SVM model, when the spectra pretreated by vector normalization and the number of principal components (NPC) is 7, the 3-CV accuracy and prediction accuracies all reached 100%. Both models have good performance and strong prediction capacity. These results demonstrate that Raman spectroscopy combined with a powerful SVM algorithm has great potential for providing an effective and accurate identification method for mineral TCMs.
Mineral traditional Chinese medicines (TCMs) are an important part of TCMs. Nine commonly used mineral TCMs, namely, borax, gypsum fibrosum, natrii sulfas exsiccatus, natrii sulfas, alumen, sal ammoniac, quartz, calcite, and yellow croaker otolith, have great differences in chemical compositions and pharmacological action. For instance, alumen, which is widely used in China for treating hemorrhoids, eczema, and scabies, has therapeutic functions including removing dampness to relieve itching, stopping bleeding, and preventing diarrhea . Natrii sulfas is used to clean away heat, relax the bowels, and treat periappendicular abscesses . Borax is the main ingredient of the famous Chinese patent medicine, Musk hemorrhoids ointment . Sal ammoniac has some degree of toxicity and should be used with caution .
However, these nine mineral TCMs are of high similarity in color, lustre, shape, texture, and cut surface character. The characteristics of these nine mineral TCMs were shown in Table 1. It can be seen that most of them are odorless, white or colorless, transparent to translucent, and irregular in shape. Furthermore, mineral TCMs are usually used in powder form, which would make them more easily confused with each other. Traditionally, mineral TCMs are mainly identified relying upon the visual recognition and physicochemical methods. The former is subjective and may lead to unreliable results, and the latter are time-consuming. Therefore, a faster and more reliable detection technique is required.
Raman spectroscopy is a nondestructive analytical tool that has long been employed for the analysis of minerals since the discovery of the Raman effect. Raman spectroscopy measures inelastic light scattering and is a vibrational spectroscopic technique that can provide arrays of fingerprint assignments of the physical vibration mode. Such information could be used to characterize differences among different materials. In recent years, the technique has been investigated extensively in the field of TCMs because of the advantages such as rapid, convenient, nondestructive, and less sample consumption. Many studies have been reported for identification of TCMs using Raman spectroscopy. In these studies, Raman spectroscopy was successfully applied to authentication of raw materials [6–8], characterization of adulterants [9, 10], detection of counterfeits [11, 12], and geographical origin identification of TCMs [13, 14]. These studies obtained good performance, indicating the excellent discriminative ability of using Raman spectroscopy techniques in TCMs. At present, spectral technologies such as near-infrared spectroscopy and Raman spectroscopy combined with different types of chemometric algorithms such as principal component analysis (PCA) , support vector machine (SVM) [16, 17], and artificial neural network (ANN) [18–20] have been successfully used for the qualitative and quantitative analysis of TCMs. Therefore, an identification model was established based on Raman spectroscopy combined with SVM to realize the rapid and accurate identification of nine easily confused mineral TCMs.
2. Materials and Methods
2.1. Sample Collection
A total of 108 batches of samples were collected from Chinese medicine markets of Bozhou in Anhui Province, Yuzhou in Henan, Yulin in Guangxi, Xi'an in Shanxi, and Zhangshu in Jiangxi as well as Jointown Pharmaceutical Group Co., Ltd and Mayinglong pharmaceutical Group Co., Ltd, respectively. Among the samples, there were 14 batches of gypsum fibrosum, 10 batches of alumen, 14 batches of sal ammoniac, 10 batches of quartz, 10 batches of natrii sulfas, 12 batches of natrii sulfas exsiccatus, 14 batches of calcite, 12 batches of yellow croaker Otolith, and 12 batches of borax. All samples were identified to be authentic by X-ray diffraction (XRD) and chemical method according to the Chinese Pharmacopoeia (ChP 2015 I).
2.2. Instruments and Software
The instruments used for this project consisted of an XPertPro X-ray diffraction instrument (PANalytical Company) and a Portable i-Raman 475–785H (B&W Tek) spectrometer. The portable Raman spectrometer was connected to a BAC150B (B&W Tek) Raman sampling accessory by a 150 cm optical fiber probe BAC102-785E (B&W Tek). The instrument was equipped with a CleanLaze (B&W Tek) laser emitting at 785 nm with continuously adjustable power from 0 to 420 mW and a thermoelectrically cooled charged-coupled detector (CCD), covering a spectral range of 65–2700 cm−1 at a resolution of approximately 3.5 cm−1.
The software used in this study included BWSpec4™ spectral data-acquisition software, OPUS 7.5 spectrum analysis software (Bruker), IBM SPSS Statistics 19 (SPSS Statistics/IBM Corp), MATLAB R2014b (The MathWorks Inc.), and libsvm toolbox (libsvm-3.1, Faruto Ultimate).
2.3. Raman Spectra Collection
Each sample was crushed into powder and then passed through a 100-mesh sieve; 2 g of each sample powder was put into a specimen cup on the stage of the sampling accessory. The spectra have been recorded in the range of 70–2695 cm−1. The laser output power was 100%. The integration time of each spectrum was adjusted to provide a better Raman signal. The measurements of each sample were replicated three times. The average value of the three spectra was selected as the spectrum information for analyzing.
2.4. Spectra Pretreatment Method
In the process of collecting sample spectrum, the original Raman spectra are often subjected to factors that are unrelated to the test sample properties, such as sample autofluorescence and variation in the measurement condition. These factors will lead to baseline drift and instability. Therefore, it is very necessary to conduct suitable spectrum pretreatment. In this study, the separate pretreatment methods vector normalization (VN), first derivation (FD), second derivation (SD), and combined pretreatment methods VN + FD and VN + SD were employed by OPUS in an effort to optimize model performance.
2.5. Spectral Data Compression Method
The high dimension of Raman spectral space will result in computational complexity and inefficiency in optimization and implementation of the SVM algorithms. It is necessary to extract spectra feature and compress spectral data in order to create effective and robust SVM model.
Principal component analysis (PCA) is a common dimension reduction method of spectrum, which can transform a number of possibly correlated variables (spectrum matrix) into a smaller number of variables called principal components (PCs). The new variables after transformation are not related with each other, which can eliminate the overlapped part of information . Moreover, these new variables include the most informative dimensions of the original variables and as can as possible without losing useful information. In this research, PCA was conducted in MATLAB 2014b to reduce dimension of original spectra and pretreated spectra, respectively.
The other method to compress spectral data in this research is extracting intensity data of characteristic peaks from complete spectra. The detailed method is as follows. The Raman shifts of strong Raman peaks of each mineral TCM were selected, and then the intensity data at these Raman shifts in every sample spectra were extracted as input variables for the SVM model instead of the complete spectra. As shown in Figure 1, one Raman peak was transformed into two peaks after being pretreated with FD, which is equal in intensity and opposite in direction. One Raman peak was transformed into three peaks after being pretreated with SD: two of which were toward y-axis positive direction, and one towards y-axis negative direction. In consideration of reducing the amount of data, when selecting characteristic peaks, we selected only the peaks toward the y-axis positive direction for the spectra pretreated with FD and selected only the peaks toward the y-axis negative direction for the spectra pretreated with SD.
2.6. Support Vector Machine
Support vector machine is a powerful supervised learning algorithm that was first proposed by Vapnik  and successfully extended by a number of other researchers in recent years, which is based on the principle of minimization of structural risk in constructing an optimally separating hyperplane that separates different classes of data. In the process, input vectors are mapped to a newly constructed high dimensional space, and then parallel hyperplanes are constructed to maximize the interplane distance which separates the data. Details about the SVM classifier can be found in [23, 24]. In this research, SVM was performed in MATLAB 2014b to build qualitative models.
Radial basis function (RBF) was adopted as the kernel function of SVM due to its strong ability in processing nonlinear problems but also is the standard kernel applied in the majority of SVM applications . RBF has two important parameters: the soft margin parameter C and kernel function parameter . The values of both parameters should be determined during the model optimization process. Common optimization methods include grid-search method (GS), particle swarm optimization algorithm (PSO), and genetic algorithm (GA). PSO was derived from the simulation of the birds in finding foods . In the PSO system, each alternative solution is regarded as a “particle,” multiple particles coexist and collaborate with each other (approximately like the foraging of birds), and each particle “flies” to a better position in the space of problems according to its own “experience” or the optimal “experience” of adjacent particle, so that the optimal solution can be searched. GA is an operation based on biological natural selection and genetic mechanism, in which selection, exchange, and catastrophe are considered as the operation method. With continuous genetic iterations, variables with good target values are retained, and thus, the optimal result can be eventually realized .
3. Results and Discussion
3.1. Sample Sets Partitioning
Each spectrum was assigned a label from 1 to 9 according to its name, and each spectrum companied by its label becomes a sample for modeling. Four samples were selected randomly from each label as validation set samples (a total of 36 samples), and the rest of 72 samples were used as training set samples (shown in Table 3). The training set is used to build the identification model, while the validation set is used for validation.
3.2. Spectral Data Compression
Because Raman peaks are distinct and sharp, the intensity data of characteristic peaks could be regarded as characteristic input variables for the SVM model instead of the complete spectra. In order to investigate the feasibility of this method, cluster analysis of the extracted intensity data of training set samples was performed based on the between-groups linkage method and square Euclidean distances by SPSS. The correction rate of cluster analysis based on the extracted intensity data of original spectra and pretreated spectra was compared in order to determine the optimal pretreatment method.
The correction rate of cluster analysis based on complete spectra, the intensity data of original spectra, and intensity data of pretreated spectra is shown in Table 4. The correction rate based on complete spectra was just 61.11%. The value improved significantly based on the intensity data. Above all, the correction rate based on intensity data extracted from spectra pretreated by FD was highest (90.28%), and the corresponding sensitivity and specificity values of each mineral TCM are shown in Table 5. It can be seen that the results of cluster analysis based on intensity data extracted from spectra pretreated by FD were good. Figure 3 shows the selected strong peaks of spectra pretreated by FD and the corresponding dendrogram. There are seven samples clustered falsely, which are mainly alumen and natrii sulfas. Both of them are sulfate mineral TCMs, so their Raman spectra are of high similarity.
The result of cluster analysis shows the feasibility of the spectral compression method based on extracting intensity data. However, some samples whose Raman spectra are of high similarity, such as alumen and natrii sulfas, cannot be differentiated by cluster analysis. Therefore, the SVM algorithm was introduced to establish an identification model on the basis of extracted intensity data.
3.3. Establishment of SVM Identification Model
Because the clustering correction rate based on intensity data extracted from spectra pretreated by FD was highest, SVM models were established by using the intensity data extracted from spectra pretreated by FD as input variables and the labels as output variables. All the input variables should be normalized firstly. The kernel parameters were optimized by GS, GA, and PSO simultaneously. The model performance was evaluated by 3-fold cross validation (3-CV) of the training set. In the process of optimization, the optimal values of C and were determined when the 3-CV accuracy reached the maximum. After SVM models were established, the training set and validation set were predicted. The prediction ability was evaluated by the prediction accuracies of the training set and validation set.
The modeling results show that the best combination of C and (C = 1, = 0.5744) was determined by GS, the 3-CV accuracy is 98.61%, and both the prediction accuracy of the training set and validation set are 100%. The optimization process is shown in Figure 4.
3.4. Spectral Compression Based on PCA
Spectra data points were compressed from more than 2000 into less than 50 by using the spectral compression method based on extracting intensity data. The other spectral data compression method investigated in this research is PCA. PCA was conducted in MATLAB 2014b to reduce dimension of original spectra and pretreated spectra. The accumulative contribution rates (ACR) of PCs are shown in Figure 5.
As shown in Figure 5, for original spectra and pretreated spectra, the ACR of the first 6 PCs reached 90%, which means that the first 6 PCs can represent most of the information provided by the complete spectrum basically. Therefore, PCA-SVM models were established using the first 6 PCs of the original spectra and pretreated spectra to investigate the effect of different spectra pretreatment methods to the model performances.
3.5. Establishment of PCA-SVM Identification Model
PCA-SVM models were established using the scores of the first 6 PCs as input variables and the labels as output variables. Table 6 shows the results of PCA-SVM models based on original spectra and pretreated spectra. It can be found that the 3-CV accuracy and the prediction accuracies of the PCA-SVM model based on VN were the highest, so VN is considered as the most suitable pretreatment method for the PCA-SVM model.
However, under the pretreatment of VN, the first 6 PCs may not be optimal for modeling. If NPC is too less, the established model would be unable to reflect the relation between sample characteristics and the spectra information, and the phenomenon of “under-fitting” will occur. However, if NPC is too much, the prediction accuracy and generalization ability of the model will be affected, and “over-fitting” will occur. Therefore, model performance should be investigated under a different NPC so as to determine the optimal NPC. In this study, PCA-SVM models were established using the scores of the first 1, 2, 3, …, 8 PCs, respectively, and the results are shown in Table 7.
As shown in Table 7, the best performance and prediction ability of the PCA-SVM model were obtained when the NPC is 7 and optimized by GA. The 3-CV accuracy and the prediction accuracies of the training set and validation set were just reached 100%. Prediction accuracy declined when less PCs were loaded, and the prediction ability of models kept unchanged when more PCs were loaded. The GA parameters-optimized process is shown in Figure 6.
The models established on the two spectra compression method have good accuracies. Compared with PCA, the spectra compression method based on extracting intensity data has some advantages; for instance, it is easy to learn and to popularize because it avoids complicated calculation. What is more, the extracted Raman shifts and intensity data of Raman peaks could be used to establish Raman spectra database. However, the disadvantage is that the errors of the peak position and intensity may cause the reduction of prediction accuracy.
The overall results indicated that Raman spectroscopy with the SVM method could efficiently identify the mineral TCMs. Compared with other characterization techniques, Raman spectroscopy shows significant advantages. It is more objective and effective than the morphological description method in identifying TCMs that are similar in appearance. It is of high chemical specificity and high accuracy than the chemical method. For an unknown mineral TCM sample, the chemical method requires at least two experiments, and Raman spectroscopy only needs one detection to obtain the identification results. Moreover, since Raman spectra can reflect the structural characteristics of crystals, it can distinguish mineral TCMs with the same chemical composition but a different crystal structure, such as calcite and aragonite whose main components are both CaCO3. Furthermore, compared with infrared spectroscopy (IR) or near-infrared spectroscopy (NIR), Raman spectroscopy has the advantages of sharp peaks, high resolution, and high specificity and are not susceptible to water or other interference. X-ray diffraction (XRD) is one of the most effective modern technology for identification of mineral TCMs. It is of high accuracy and good specificity, but the requirement of sample preparation is high, and the spectra analysis is complicated. The advantages of nondestructive and no sample pretreatment of Raman spectroscopy make it more convenient and efficient than XRD.
The models established using Raman spectroscopy based on the SVM method have excellent performance and strong prediction capacity for identifying these nine easily confused mineral TCMs. It is rapid and convenient to collect spectra of mineral TCMs without sample pretreatment using a portable Raman spectrometer, and the result of identification could be obtained immediately when the spectra of samples were loaded into the SVM model. It means that the SVM model can not only meet the need of rapid inspection in the Chinese medicine markets but also be used in the production process to control the quality of raw materials to ensure the safe use of the medication.
The result shows the feasibility of the SVM algorithm in the rapid identification and quality analysis by Raman spectroscopy. Raman spectroscopy in combination with chemometric algorithms such as PCA, SVM, PSO, and GA has the potential to contribute significantly to the analysis of other mineral TCMs. In the next work, more samples should be used for improving the reliability and applicability of the model. Raman spectra database of mineral TCMs will be created with the increase of samples. Other algorithms will also be investigated in the future to improve the model performance and expand the application of Raman spectroscopy in the field of TCMs.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This study was supported by the Wuhan Special Fund of Biotechnology and New Medicine from Development Action Plan of High-Tech Industry in 2012 (201260523193).
National Pharmacopoeia Committee, China Pharmacopoeia of 2015 Edition, vol. 1, China Medical Science and Technology Press, Beijing, China, 2015.
J. X. Yu, T. L. Lu, C. Q. Mao, and Q. Chen, “Research on anti-inflammation and acute toxicity test in sal ammoniac and halite violaceous,” Journal of Nanjing University of Chinese Medicine, vol. 28, pp. 77–79, 2012.View at: Google Scholar
Editorial Committee of Zhong Yao Da Ci Dian, Zhong Yao Da Ci Dian, Shanghai Scientific and Technical Publishers, Shanghai, China, 2nd edition, 2006.
Editorial Committee of Zhong Hua Ben Cao, State Administration of Traditional Chinese Medicine of China, Zhong Hua Ben Cao, Shanghai Scientific and Technical Publishers, Shanghai, China, 1999.
Y.-H. Liao, C.-H. Wang, C.-Y. Tseng, H.-L. Chen, L.-L. Lin, and W. Chen, “Compositional and conformational analysis of yam proteins by near infrared fourier transform Raman spectroscopy,” Journal of Agricultural and Food Chemistry, vol. 52, no. 26, pp. 8190–8196, 2004.View at: Publisher Site | Google Scholar
X. Qi, C. C. Zeng, H. P. Liu, and S. H. Liu, “Rapid quantitative analysis of puerarin using Raman spectroscopy,” Spectroscopy, vol. 28, pp. 34–39, 2013.View at: Google Scholar
Z. J. Zhang, Q. Zhou, J. Z. Wei et al., “Validation of the crystal structure of medicinal realgar in China,” Guang Pu Xue Yu Guang Pu Fen Xi, vol. 31, pp. 291–296, 2011.View at: Google Scholar
Q. E. Wan, H. P. Liu, H. M. Zhang, and S. H. Liu, “Identification of ginseng and its counterfeit by laser Raman spectroscopy,” Guang Pu Xue Yu Guang Pu Fen Xi, vol. 32, pp. 989–992, 2012.View at: Google Scholar
W. N. Wang, D. L. Chen, M. F. Zhu, and H. M. Zhang, “The analysis and identification of Fritillaria cirrhosa by Raman spectra,” Guang Pu Xue Yu Guang Pu Fen Xi, vol. 33, pp. 2109–2111, 2013.View at: Google Scholar
H. Huang, J. Li, R. Chen et al., “Discrimination of Huangqi (Radix Astragali Seu Hedysari) from different producing areas using Raman spectroscopy and statistical analysis,” Journal Fuzhou University, vol. 42, pp. 646–652, 2014.View at: Google Scholar
J. Ming, L. Chen, K. L. Chen, and B. S. Huang, “Qualitative and quantitative analysis of amber adulteration using near infrared spectroscopy based on support vector machine,” Chinese Medicinal Materials, vol. 40, pp. 32–37, 2017.View at: Google Scholar
S. Kokot, Y. Lai, and Y. Ni, “Classification of raw and roasted Semen Cassiae samples with the use of Fourier transform infrared fingerprints and least squares support vector machines,” Applied Spectroscopy, vol. 64, no. 6, pp. 649–656, 2010.View at: Google Scholar
L. Chen, J. Ming, M. Y. Yuan, B. S. Huang, and K. L. Chen, “Quantitative model of Raman spectra for CaF2 in fluoritum based on BP-ANN algorithm,” Chinese Journal of Experimental Traditional Medical Formulae, vol. 22, pp. 77–82, 2016.View at: Google Scholar
Y. Sun, L. Chen, B. Huang, and K. Chen, “A rapid identification method for calamine using near-infrared spectroscopy based on multi-reference correlation coefficient method and back propagation artificial neural network,” Applied Spectroscopy, vol. 71, no. 7, pp. 1447–1456, 2017.View at: Publisher Site | Google Scholar
X. D. Zhang, L. Chen, Y. B. Sun, Y. Bai, B. S. Huang, and K. L. Chen, “Determination of zinc oxide content of mineral medicine calamine using near-infrared spectroscopy based on MIV and BP-ANN algorithm,” Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, vol. 193, pp. 133–140, 2017.View at: Publisher Site | Google Scholar
K. R. Beebe, R. J. Pell, and M. B. Seasholtz, Chemometrics: A Practical Guide, John Wiley & Sons, London, UK, 1988.
V. Vapnik, The Nature of Statistical Learning Theory, Springer Berlin Heidelberg, New York, NY, USA, 1995.
S. R. Amendolia, G. Cossu, M. L. Ganadu, B. Golosio, G. L. Masala, and G. M. Mura, “A comparative study of K-nearest neighbour, support vector machine and multi-layer perceptron for thalassemia screening,” Chemometrics and Intelligent Laboratory Systems, vol. 69, no. 1-2, pp. 13–20, 2003.View at: Publisher Site | Google Scholar
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines: And Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, UK, 2000.
J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of the IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948, Honolulu, Hawaii, May 2002.View at: Google Scholar
M. Jalali-Heravi and A. Kyani, “Application of genetic algorithm-kernel partial least square as a novel nonlinear feature selection method: activity of carbonic anhydrase II inhibitors,” European Journal of Medicinal Chemistry, vol. 42, no. 5, pp. 649–659, 2007.View at: Publisher Site | Google Scholar
J. Ming, L. Chen, K. L. Chen, Y. Cao, and B. S. Huang, “Identification of six white crystal mineral medicines,” Chinese Journal of Experimental Traditional Medical Formulae, vol. 22, pp. 33–38, 2016.View at: Google Scholar
E. Zu, M. Li, and P. Zhang, “Study on jades of SiO2 by Raman spectroscopy,” Journal of Kunming University of Science and Technology, vol. 25, pp. 77-78, 2000.View at: Google Scholar