Robust and Automated Internal Quality Grading of a Chinese Green Tea (Longjing) by Near-Infrared Spectroscopy and Chemometrics
Near-infrared (NIR) spectroscopy and chemometric methods were applied to internal quality control of a Chinese green tea, Longjing, with Protected Geographical Indication (PGI). A total of 2745 authentic Longjing tea samples of three different grades were analyzed by NIR spectroscopy. To remove the influence of abnormal samples, The Stahel-Donoho estimate (SDE) of outlyingness was used for outlier analysis. Partial least squares discriminant analysis (PLSDA) was then used to classify the grades of tea based on NIR spectra. Different data preprocessing methods, including smoothing, taking second-order derivative (D2) spectra, and standard normal variate (SNV) transformation, were performed to reduce unwanted spectral variations in samples of the same grade before classification models were developed. The results demonstrate that smoothing, taking D2 spectra, and SNV can improve the performance of PLSDA models. With SNV spectra, the model sensitivity was 1.000, 0.955, and 0.924, and the model specificity was 0.979, 0.952, and 0.996 for samples of three grades, respectively. FT-NIR spectrometry and chemometrics can provide a robust and effective tool for rapid internal quality control of Longjing green tea.
Tea is one of the most popular beverages around the world and favored for its various healthy benefits [1, 2]. According to the degree of fermentation, teas can be generally classified into three types: unfermented, partially fermented, and fully fermented . In China, although all the above three types of teas are produced and consumed, green tea is the most favorable for its special flavor and taste.
Longjing tea, a green tea produced from Hangzhou and its neighboring areas, has been traditionally recognized as a top-grade green tea for its top quality as well as its cultural backgrounds [4, 5]. Longjing tea leaves are roasted soon after picking to cease the natural oxidation process. When steeped, the flat and straight leaves produce a yellow-green color. Its flavor and taste are very gentle and sweet, although it has one of the highest concentrations of catechins among teas [4, 5], which is an important indicator of high-quality green teas.
Because Longjing tea has a very high commercial value, the quality control of Longjing tea is urgently demanded against various counterfeit Longjing teas. The internal grading especially among authentic Longjing tea is the foundation for its quality control. As a green tea with Protected Geographical Indication (PGI), the three producing areas of Longjing are explicitly defined as West Lake and its neighboring areas (I), Qiantang (II), and Yuezhou (III). For a long time, it has been recognized that the quality of Longjing tea can be ranked according to their producing areas, namely, I, II, and III. Therefore, it is necessary to develop a rapid and effective method to distinguish different grades of Longjing tea.
Recently, near-infrared (NIR) spectroscopy has been extensively used in food quality control [6–8]. Compared with traditional analytical methods, NIR has some advantages, including (1) reduced sample preparation, labor, and cost of analysis; (2) the potential for nondestructive and online analysis; (3) comprehensive characterization of multiple components. However, because NIR spectra are often characterized by low spectral resolution and serious peak overlapping, chemometric methods are required to extract useful information concerning food quality from the measured signal. Among various pattern recognition techniques, classification methods are the most frequently used. Some commonly used classification or discrimination analysis (DA) methods include support vector machines (SVMs) , k-nearest neighbors (KNN) , linear discriminant analysis (LDA) , and partial least squares discriminant analysis (PLSDA) .
This paper aims at developing a rapid analysis method for grading Longjing tea by NIR spectroscopy coupled with PLSDA. Different data preprocessing methods, including smoothing , taking second-order derivative (D2) spectra , and standard normal variate (SNV)  transformation, were performed to reduce unwanted spectral variations in samples of the same grade before chemometric models were developed.
2. Experimental and Methods
2.1. Tea Samples and NIR Analysis
A total of 2745 authentic Longjing tea samples of three types were collected from the local tea plants. The detailed information concerning samples is summarized in Table 1. All of the samples were stored in a cool, dark, and dry place with integral packaging before NIR spectroscopy analysis.
Nondestructive NIR analysis of tea was performed using a TENSOR37 Fourier transform NIR spectrometer (Bruker, Ettlingen, Germany) in the wavelength range of 4000–12000?cm-1. Each sample was measured in a quartz cup without any pretreatments. For each sample, 32 scans were carried out with a resolution of 8?cm-1 at 25°C using OPUS software. Increasing the number of scans did not significantly improve the signal. The average of the 32 scans was saved as a raw spectrum for chemometric analysis.
2.2. Outlier Diagnosis, Data Splitting, and PLSDA
Outliers are the abnormal samples that deviate from the mass of samples. For classification, outliers not only would lead to bias and error of a model but also can result in misleading estimation of model performance. Considering the multivariate nature of NIR spectra and to avoid the masking effects of multioutliers, robust diagnosis with dimension reduction techniques are suitable to detect the NIR outliers. The Stahel-Donoho estimate (SDE) of outlyingness  was used for outlier diagnosis for each grade of Longjing. SDE projects each high-dimensional sample onto randomly selected directions for many times. The SDE outlyingness of each object can be computed using the robust location (median) and scatter estimator (median absolute deviation, MAD). Objects with an especially large SDE outlyingness values were detected as outliers and removed. The number of projections was 500 in this paper.
With outliers removed, the Kennard and Stone (K-S) algorithm  was performed to divide the measured data into a training set and prediction set. K-S algorithm can select a prediction set of objects that are scattered uniformly in the range of training objects. Because the distributions of three grades of Longjing were different, the K-S algorithm was performed separately on each grade of tea.
Suppose that one has an matrix of the spectra at wavelengths for trainingobjects, for multiclass classification, is the total number of samples collected from all the (in this paper, ) different classes. A response matrix is designed corresponding to the category of each object in . All the elements in are originally set -1, and if an object is from class , then the element at th row and th column in is assigned a value of 1. Then, PLS models can be developed to predict each column of using . For prediction, a new object is classified into class when the th element of its predicted response vector is above 0.
2.3. Model Validation and Evaluation
For PLSDA, an important parameter is the number of latent variables (LVs) or the model complexity. Too many latent variables would lead to overfitting of the model and a bad generalization performance, while selecting too few LVs would underfit the model. In this paper, Monte Carlo cross validation (MCCV)  was used to select the number of LVs in PLSDA model. The number of PLSDA components was estimated as the mean percentage error of MCCV (MPEMCCV) was minimized: where is the times of MCCV data splitting, is the number of prediction samples, and the number of misclassified for the th splitting during MCCV.
To compare the performance of classification models, sensitivity and specificity of test set for each grade were computed as where TP, FN, TN, and FP denote the numbers of true positives, false negatives, true negatives, and false positives, respectively. In this paper, objects in each grade were denoted as positives, and the other two grades were denoted as negatives.
3. Results and Discussion
Some of the raw NIR spectra of Longjing tea are shown in Figure 1. Seen from Figure 1, the raw spectra of three grades of Longjing have very similar absorbance patterns, and the signals are characterized by low absorbance and baseline. In each grade, the spectra have considerable variations and may overlap with those of the other grades. Therefore, data preprocessing was demanded to reduce the unwanted variations in each grade. Figure 2 demonstrates the spectra preprocessed by smoothing, taking second-order derivative (D2), and SNV. Spectral smoothing seems to obtain an improved SNR but cannot remove the baselines in the data. Second-derivative spectra have enhanced the local peak differences, for example, around 7200?cm-1. SNV seems to be able to remove most of the within-grade variations.
The SDE outlyingness diagnosis plots of the three grades of Longjing are shown in Figure 3. According to the rule, a SDE value above 3 is recognized as an outlier. 4, 9, and 20 objects were removed from grades I, II, and III, respectively. Therefore, 461, 891, and 1360 objects were left for grades I, II, and III, respectively. To investigate the effects of data preprocessing on classification performance, all the PLSDA models were trained and tested with the same data sets. The K-S algorithm was performed on the raw data of each grade to obtain training and test objects. Finally, the training set contains 1800 objects (grade I, 300; grades II, 600; grade III, 900) for training and 912 objects (grade I, 161; grades II, 291; grade III, 460) for prediction.
With different preprocessing methods, PLSDA models were developed, and MCCV was performed to estimate the number of latent variables. For MCCV, the original training set was randomly divided into training (50%) and prediction objects (50%) for 20 times. The classification results and model parameters of PLSDA with different preprocessing are summarized in Table 2. Seen from Table 2, D2 and SNV spectra obtained significantly improved prediction accuracy compared with raw and smoothed spectra. The best classification models were obtained by SNV-PLSDA with sensitivity/specificity of 1.000/0.979, 0.955/0.952, and 0.924/0.996 for Longjing of grades I, II, and III, respectively. Figure 4 presents the misclassification results by different preprocessing methods. For most of the models, the classification sensitivity and specificity for each grade of tea were above 0.9, indicating the effectiveness of NIR for characterization and classification of Longjing. Moreover, D2 and SNV can reduce unwanted variations by removing part of baseline and scattering effects; therefore, D2 and SNV should be preferred for spectral preprocessing.
Rapid and reliable internal quality control of Longjing green tea was performed using NIR analysis and chemometrics. Comparison of different preprocessing methods demonstrates taking SNV and D2 transformations that can effectively reduce unwanted spectral variations in each grade of tea. NIR analysis and pattern recognition methods demonstrate potential for nondestructive and rapid discrimination of internal quality grades of Longjing. A practical problem is the seasonal and year-to-year variations in the chemical compositions of green tea. Therefore, our future work will be developing quality control models for Longjing tea with different harvest seasons and years.
X.-S. Fu and L. Xu equally contributed to this study.
The authors are grateful to the financial support from the National Public Welfare Industry Projects of China (no. 201210010, 201210092, and 2012104019), the National Natural Science Foundation of China (no. 31000357), the Hangzhou Programs for Agricultural Science and Technology Development (no. 20101032B28), and the Key Scientific and Technological Innovation Team Program of Zhejiang Province (no. 2010R50028).
L. Xu, D. H. Deng, and C. B. Cai, “Predicting the age and type of Tuocha tea by fourier transform infrared spectroscopy and chemometric data analysis,” Journal of Agricultural and Food Chemistry, vol. 59, pp. 10461–10469, 2011.View at: Google Scholar
J. K. Lin, C. L. Lin, Y. C. Liang, S. Y. Lin-Shiau, and I. M. Juan, “Survey of catechins, gallic acid, and methylxanthines in green, oolong, pu-erh, and black teas,” Journal of Agricultural and Food Chemistry, vol. 46, no. 9, pp. 3635–3642, 1998.View at: Google Scholar
G. Zhou, L. Zhu, T. Ren, L. Zhang, and J. Gu, “Geochemical characteristics affecting the cultivation and quality of Longjing Tea,” Journal of Geochemical Exploration, vol. 55, no. 1–3, pp. 183–191, 1995.View at: Google Scholar
L. Xu, P. T. Shi, Z. H. Ye et al., “Rapid geographical origin analysis of pure West Lake lotus root powder (WL-LRP) by near-infrared spectroscopy combined with multivariate class modeling techniques,” Food Research International, vol. 49, pp. 771–777, 2012.View at: Google Scholar
Q. Fu, J. Wang, G. Lin, H. Suo, and C. Zhao, “Short-wave near-infrared spectrometer for alcohol determination and temperature correction,” Journal of Analytical Methods in Chemistry, vol. 2012, Article ID 728128, 7 pages, 2012.View at: Google Scholar
H. Y. Fu, S. Y. Huan, L. Xu et al., “Construction of an efficacious model for a nondestructive identification of traditional chinese medicines liuwei dihuang pills from different manufacturers using near-infrared spectroscopy and moving window partial least-squares discriminant analysis,” Analytical Sciences, vol. 25, no. 9, pp. 1143–1148, 2009.View at: Publisher Site | Google Scholar
L. Xu, C. B. Cai, H. F. Cui, Z. H. Ye, and X. P. Yu, “Rapid discrimination of pork in Halal and non-Halal Chinese ham sausages by Fourier transform infrared (FTIR) spectroscopy and chemometrics,” Meat Science, vol. 92, no. 4, pp. 506–510, 2012.View at: Google Scholar
A. Savitzky and M. J. E. Golay, “Smoothing and differentiation of data by simplified least squares procedures,” Analytical Chemistry, vol. 36, no. 8, pp. 1627–1639, 1964.View at: Google Scholar
R. J. Barnes, M. S. Dhanoa, and S. J. Lister, “Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra,” Applied Spectroscopy, vol. 43, no. 5, pp. 772–777, 1989.View at: Google Scholar
S. Van Aelst, E. Vandervieren, and G. Willems, “A Stahel-Donoho estimator based on huberized outlyingness,” Computational Statistics & Data Analysi, vol. 56, pp. 531–542, 2012.View at: Google Scholar
R. W. Kennard and L. Stone, “Computer aided design of experiments,” Technometrics, vol. 11, pp. 137–148, 1969.View at: Google Scholar