Abstract

This paper aims at developing a rapid and nondestructive method for analyzing the shelf life of preserved eggs (pidan) by near infrared (NIR) spectroscopy and nonlinear multivariate calibration. A major concern with a nonlinear model is that the noncomposition-correlated spectral variations among pidan objects of different batches and production dates would unnecessarily increase model complexity and cause overfitting and degradation of prediction. To reduce the negative influence of unwanted spectral variations, stacked least squares support vector machine (LS-SVM) with an ensemble of 62 commonly used preprocessing methods is proposed to automatically optimize data preprocessing and develop the nonlinear model. The analysis results indicate that stacked LS-SVM can obtain stable calibration model, and the prediction accuracy is improved compared with models with single-preprocessing methods. Since LS-SVM is much faster than its ordinary counterparts, stacked LS-SVM with ensemble preprocessing can be performed within an acceptable computational time. When the objects and spectral variations are very complex, the proposed method can provide a useful tool for data preprocessing and nonlinear multivariate calibration.

1. Introduction

Preserved egg (pidan) is a popular traditional food in South and East Asian countries, including China, South Korea, Thailand, and Japan. Although its nutritional value is slightly decreased compared with fresh eggs, it is favored and consumed in at least 30 different countries globally for its pleasant and particular taste, flavor, texture, and especially its extended shelf life compared with unprocessed eggs [1]. It is commonly recognized that the storage of pidan should not exceed six months. For pidan out of the original pickle, besides the degradation of food quality, the risk of contamination by microbes increases with storage, which has become a major safety concern with pidan products [2]. Therefore, rapid and reliable methods for analysis of pidan shelf life are highly required to perform routine analysis of pidan products at the market.

As an important quality parameter, the shelf life of pidan has a very complex relationship with its internal chemical components, and it is still difficult to directly estimate the shelf life using the results of chemical or biological analyses [3]. Near infrared (NIR) spectroscopy combined with multivariate calibration methods has been widely used in rapid analysis of food quality parameters [4]. By developing the relationship between the measured NIR spectra and the reference values of sample parameters, future objects can be analyzed and predicted. For pidan, NIR reflectance techniques combined with chemometrics can achieve rapid and nondestructive analysis, which is very suitable because there are usually many batches of samples to be analyzed.

To model the complex relationship between the measured NIR spectra and pidan shelf life, nonlinear multivariate calibration methods like artificial neutral net (ANN) [5], kernel partial least squares (KPLS) [6, 7], and support vector machines (SVMs) [8, 9] are commonly used. Compared with linear models like principal components regression (PCR) [10] and partial least squares (PLS) [11], though nonlinear models have better model flexibility and accuracy when the data nonlinearity is significant, they are generally more complex and have a higher risk of overfitting. To develop a practical calibration model for analysis of pidan shelf life, sufficient and representative objects are collected to train the model. Therefore, proper data preprocessing is necessary to remove the noncomposition-correlated spectral variations among objects of different batches and with a wide range of shelf life, which would lead to overfitting of a nonlinear model.

Considering the lack of sufficient prior information and an incomplete knowledge of the raw NIR data, data preprocessing in multivariate calibration still depends on simple trial and error. Selecting the suitable preprocessing methods might be data dependent and require special expertise and experience of the chemometricians. An ensemble preprocessing method [12, 13] has been recently proposed by combining different single-preprocessing methods, such as smoothing and taking derivatives and standard normal variate (SNV) [14] with stacked regression [1517]. This method can automatically select and weight the most suitable preprocessing methods and has been demonstrated to be effective in linear PLS calibration of NIR data [13].

In this paper, stacked least squares support vector machine (LS-SVM) [18, 19] with ensemble preprocessing is proposed to simultaneously optimize data preprocessing and develop a nonlinear calibration model. The proposed chemometric method and the results of data analysis will be presented in detail in the following subsections.

2. Theory and Method

2.1. Pidan Samples and NIR Analysis

Three brands of pidan objects are analyzed, and for each brand the NIR spectra are measured at certain time intervals. All the pidan objects are made of fresh duck eggs with unleaded processing methods. The detailed information concerning samples is presented in Table 1. Each pidan object is washed vigorously with deionized water and is left to dry completely before NIR analysis. During storage, the samples are stored in a cool, dark, and dry place with integral packaging. The NIR spectra are measured in the diffuse reflectance modeon a Bruker-TENSOR37 FTIR spectrometer (Bruker Optics, Ettlingen, Germany). A PbS detector is used with an internal gold background as the reference. The equatorial region of an egg directly contacts the NIR probe. For each spectrum, 64 scans are performed, and more scans cannot improve the signal significantly. Considering the possible differences in the internal composition of an egg, the average of three spectra measured around the equatorial region of each egg is used as the raw spectrum. The wavelength range is from 12,000 to 4,000?cm-1 with a resolution of 8?cm-1.

2.2. LS-SVM

Data analysis is performed on MATLAB 7.10.0 (MathWorks, Sherborn, MA). The toolbox LS-SVMlab v1.8 [19] is used for basic LS-SVM algorithms. Here a brief introduction of LS-SVM is given, and, for a detailed description of LS-SVM algorithms, one can refer to [18, 19]. Consider a training set of objects with input features and response values , where and are the -dimensional and one-dimensional vector spaces, respectively. Nonlinear regression by LS-SVM can be written as where is the regression coefficient vector, is the model error, is the constant term, and the function maps the original data into a nonlinear space. LS-SVM solution can be obtained by solving the following optimization problem: where is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the learning errors.

In this paper, the frequently used Gaussian radial basis function (RBF) is used as nonlinear transformation: where is the kernel width parameter that controls the nonlinear nature of the regression. For LS-SVM, the two model parameters, and s, should be optimized simultaneously.

2.3. Stacked LS-SVM with Ensemble Preprocessing

Automatic selection and weighting of different preprocessing methods can be achieved by combining submodels with differently preprocessed NIR spectra. The ensemble model is the linear combination of submodels: where is the final prediction of response values by ensemble model, is the prediction of a submodel, and is the weighting vector. In this paper, 62 different preprocessing methods as listed in Table 2 (no. 1–62), including smoothing and taking first-order derivative (D1) and second-order derivative (D2); standard normal variate (SNV) and their combinations are used to preprocess the raw data, and 62 submodels are developed. Smoothing and taking D1 and D2 spectra are performed using the method proposed by Savitzky and Golay [20]. These methods are preferred because they are simple to compute and frequently used.

Monte Carlo cross-validation (MCCV) [21] stacked regression is used to derive the weighting vector by nonnegative least squares regression (NNLSR): where is the stacking of all the reference response values of the leave-out objects in MCCV and () contains the corresponding predicted values by submodels. This method can obtain the lowest root mean squared error of MCCV (RMSEMCCV) among all possible linear combinations under nonnegative constraints. Moreover, by multiple predictions of MCCV, the length of is very large, so tends to be sparse or most of its elements will be zeros, which means that just a few proper preprocessing methods are included and weighted in the final ensemble model.

The above-stacked LS-SVM with ensemble preprocessing can be developed in two steps. Firstly, a set of submodels are developed based on different preprocessing methods as in Table 2 (no. 1–62). In this paper, simplex method is performed to search for and s that minimize the error of cross-validation for each submodel. Secondly, MCCV is performed for each submodel with the selected parameters, and the submodel weights are deduced as in (5).

3. Results and Discussion

The raw spectra, first-order derivative (D1), second-order derivative (D2), and SNV spectra of the pidan objects are plotted in Figure 1. As seen from Figure 1, the raw NIR spectra of pidan samples with different shelf life values are very similar in the wavelength range of 4,000–12,000?cm-1. Some of the peaks can be assigned as 6,000–7,000?cm-1 (overlapping of the first overtone of O–H stretching and N–H stretching), 5,160?cm-1 (the combination of the baseband of O–H stretching and the baseband of H–O–H deformation), 4,870?cm-1 (combination of N–H stretching and deformation of peptide groups), and 4,600?cm-1 (combination of C=O stretching and deformation of peptide groups). Although it is difficult to attribute each peak to certain chemical components, NIR can characterize the changes of some components like proteins and water involved in storage. Taking derivative spectra can significantly improve the peak resolution and highlight some detailed frequencies in the signals. SNV can remove some of the spectral variations, but the preprocessed spectra still have some backgrounds retained. Therefore, smoothing and taking derivatives coupled with SNV are also included as options of data preprocessing.

The Stahel-Donoho estimate (SDE) of outlyingness [22, 23] is performed for outlier diagnosis by projecting each high-dimensional object onto random directions for many times and estimating the robust location and scatter of projections. In this paper, the number of random projections is set to be 500. Figure 2 demonstrates the SDE plot of 884 measured pidan objects based on the raw NIR spectra. According to the rule, an object with SDE value over 3 can be detected as an outlier. Eight outliers are detected and excluded from calibration and prediction.

The DUEPLX algorithm [24] is then performed on the raw NIR spectra of the remaining 876 objects to form representative data sets. 657 objects are used for training and 219 objects for prediction. To compare the performance of different data preprocessing methods on calibration, all the models are trained and evaluated with the same training and test sets.

The parameters of LS-SVM submodels are optimized with simplex method to minimize the mean squared errors (MSE) of 20-fold cross-validation. The weights of submodels are estimated by MCCV stacked regression as in (5). Considering that the size of training data set is large (), for MCCV, each time, a comparatively high percent (40%) of the training samples are left out for prediction, and the number of resampling is 50. With different preprocessing methods, LS-SVM and stacked models are developed, and the results are demonstrated in Table 2. The computation time for stacked LS-SVM with ensemble preprocessing is about 2 hours on our computer (Intel Core 2 Duo CPU E7500 @ 2.93?GHz, 1.96?GB memory). Considering the size of the original data (876 by 2074), the computational time is acceptable.

For submodels, submodel 13 in Table 2 has an abnormally large root mean squared errors of MCCV (RMSEMCCV) and root mean squared errors of prediction (RMSEP). This might be attributed to the underfitting caused by a large number of data points (9) and a low polynomial order . It can be seen that, with single preprocessing, the best predictions with RMSEP of 7.3 days are obtained by smoothing with 15 points and third-order polynomial. All the single-preprocessing methods do not significantly improve the prediction accuracy compared with the raw data (RMSEP = 7.6 days).

The ensemble preprocessing method obtains a reduced RMSEP of 6.2 days. The results of ensemble model are demonstrated in Figure 3. As seen from Figure 3, the distribution of submodel weights indicates that the ensemble model includes all types of single-preprocessing methods, such as smoothing, D1, D2, smoothing + SNV, D1 + SNV, and . The results demonstrate that ensemble model can optimize data preprocessing by taking advantage of different preprocessing methods. Though, for all the models, RMSEMCCV is higher than RMSEP, RMSEMCCV and RMSEP show a similar tendency. Therefore, by minimizing RMSEMCCV, the ensemble model can obtain stable predictions of test objects.

4. Conclusions

Stacked LS-SVM regression with ensemble preprocessing is proposed as a data preprocessing and nonlinear calibration method. For NIR analysis of pidan shelf life, this method can automatically combine and weight a set of 62 single-preprocessing methods. The prediction accuracy was improved compared with the results using raw spectra and other single-preprocessing methods, including smoothing and taking derivatives, SNV, and their combinations. Since LS-SVM is much faster than its counterparts, the proposed stacked LS-SVM algorithm can be performed within an acceptable computational time. When the objects are very complex and it is necessary to reduce unwanted variations and the risk of overfitting, this method can provide a useful tool for data preprocessing and nonlinear multivariate calibration.

Acknowledgments

This work is financially supported by the National Public Welfare Industry Projects of China (nos. 201210010, 201210092, and 2012104019), the National Natural Science Foundation of China (no. 31000357), Hangzhou Programs for Agricultural Science and Technology Development (no. 20101032B28), and the Key Scientific and Technological Innovation Team Program of Zhejiang Province (no. 2010R50028).