Near-infrared (NIR) spectra of apple samples were submitted in this paper to principal component analysis (PCA) and successive projections algorithm (SPA) to conduct variable selection. Three pattern recognition methods, backpropagation neural network (BPNN), support vector machine (SVM), and extreme learning machine (ELM), were applied to establish models for distinguishing apples of different varieties and geographical origins. Experimental results show that ELM models performed better on identifying apple variety and geographical origin than others. Especially, the SPA-ELM model could reach 98.33% identification accuracy on the calibration set and 96.67% on the prediction set. This study suggests that it is feasible to identify apple variety and cultivation region by using NIR spectroscopy.

1. Introduction

China is one of the main fruit-producing and fruit-consuming countries in the world. In recent years, both the area for fruit cultivation and the yield of fruits were increased continuously. Since apple has advantages of high nutrient value, high storage, and short supply cycle, it has become one of the four major fruits in the world. In 2014, the apple cultivation area and production in China reached 22724 km2 and 40.93 million tons, respectively, according to the Food and Agriculture Organization (FAO) report. The overall quality of apple could be determined by external attributes (such as size, colour, and texture) and internal attributes (such as soluble solid content (SSC), total acid content (TAC), and vitamins). These attributes are greatly affected by the variety and cultivation region of apple. Different apple has specific firmness, crispness, juiciness, and taste. Apple fruits that are firm, crispy, juicy, and tasteful are more popular among consumers [1]. Moreover, apples of different varieties or geographical origins could be easily mixed during harvesting and marketing. Therefore, effective and reliable technologies that could identify apple variety and geographical origin are demanded urgently by sellers and consumers.

To determine the variety of fruits, many methodologies have been explored, such as deoxyribonucleic acid (DNA) analysis [2, 3], gas chromatography (GC) analysis [4], and amino acid composition [5]. However, these methods are always containing a considerable amount of time, manual work, and sample preparation [6].

NIR spectroscopy, as a rapid nondestructive detection method, has been proven effective in determining the internal quality attributes of various agricultural products and food [711]. Compared with conventional chemical and physical analytical technologies, NIR spectroscopy has the advantages of easy operation, fast detecting speed, and nondestructive measurement. NIR spectral data have information that are concerning the relative proportions of C-H, N-H, and O-H bands which are the main structural constituents of organic molecules [12, 13]. Many studies explored the application of NIR spectroscopy on SSC measurement of fruits, such as apple [14, 15], kiwifruit [16], and grape [17].

The main object of this study is to explore the feasibility of applying NIR spectroscopy for distinguishing apple variety and geographical origin. NIR diffuse reflectance spectra of 6 different apples (Fuji apple from Shandong, Fuji apple from Shaanxi, Red Star apple from Shandong, Red Star apple from Shaanxi, Gala apple from Shandong, and Gala apple from Shaanxi) were collected. Two variable selection algorithms were used, and three modelling methods were applied to establish models for distinguishing apples of different species and geographical origins. The capabilities of different models to identify apple variety and cultivation region were also investigated.

2. Materials and Methods

2.1. Sample Preparation

100 Fuji apples (50 from Shandong and 50 from Shaanxi), 100 Red Star apples (50 from Shandong and 50 from Shaanxi), and 100 Gala apples (50 from Shandong and 50 from Shaanxi) were purchased from two main local markets (Table 1). The retailers guaranteed the apples’ varieties and cultivation regions, and the skins of these samples were smooth and perfect. Before measurements, all samples were placed in airtight polyethylene bags and stored in a refrigerator to keep at cold temperature (4 ± 1°C) for 2 days. After storage, the apples were taken away from refrigerator, washed with clean water, wiped dry, and kept at room temperature (24 ± 2°C) for about 3 hours, and then they were used for measurements.

In this study, 40 Fuji apples (20 from Shandong and 20 from Shaanxi), 40 Red Star apples (20 from Shandong and 20 from Shaanxi), and 40 Gala apples (20 from Shandong and 20 from Shaanxi) were selected randomly as the prediction set, and the remaining 180 apples were used as the calibration set. The calibration set was used to establish models, and the prediction set was used to evaluate the performances of the established models.

2.2. Spectra Acquisition

The NIR spectra were collected in diffuse reflectance mode using Ocean Optics USB2000-VIS-NIR-ES spectrometer (Ocean optics, USA), equipped with HL-2000 tungsten halogen light sources (Ocean optics, USA), and optical fibre reflection Probes (QR600-7-VIS-NIR, Ocean optics, USA). The NIR diffuse reflectance spectra were collected between 400 and 1021 nm with an internal of about 0.33 nm, which resulted in 1888 variables for each spectrum. Software Ocean View (Ocean optics, USA) was used to collect and transform the spectra. All measurements were undertaken at room temperature (24 ± 2°C).

Before measuring the spectra, the spectrometer was turned on for at least 1 hour for warming up. When measuring the spectra, the NIR optical fibre probe was kept close to the surface of apple in order to avoid surface reflectance and air interference. For each intact apple sample, the diffuse reflectance spectra was obtained at 15 different points that were randomly chosen along the equator, and each location was scanned 10 times for spectra measurements. Thus, a total of 150 scans were averaged to represent the spectral data for one apple and then used in further data analysis.

2.3. Principal Component Analysis

PCA is a very effective data mining technique and has been widely used for dealing with spectra data [18]. The principle of PCA is to reduce data dimension and orthogonalize the original multidimensional data to obtain a set of values of linearly uncorrelated variables to minimize the chance of overfitting and improve the speed of training procedure. This set of values is called principal components (PCs). In the case of small size of data set, PCs are computed simultaneously from a single matrix decomposition. The first PC accounts for as much of the variability in the original data as possible, and each subsequent component has lower variance possible than the preceding one. PCA could transform high-dimensional data into lower dimensional data without losing original information [19]. In most cases, PCs have much smaller variable numbers, yet still provide information that could describe most of variance of the original data.

2.4. Successive Projections Algorithm

SPA is a forward variable selection method, and its purpose is to select wavelengths whose information is minimally redundant [20]. In the process of SPA, the first variable of spectral data is used as the initial variable and then incorporates a new one at each iteration, until the preset number N is reached. At the next iteration, the second variable of spectral data is selected as the initial variable , until all variables are used as the initial one. SPA has been successfully applied to select variables in spectra [21, 22].

For a classification problem, the number of extracted wavelengths could be determined by value (validation cost) [23]:where is the number of the validation samples, and is defined as follows:where the numerator is the squared Mahalanobis distance between (a vector of the sample at the selected wavelengths by SPA) and (the mean of the Ii-th variety apples at each wavelength). This numerator represents the squared Mahalanobis distance between and the closest wrong variety. should be as small as possible, which means is close to its true variety and distant from other varieties.

2.5. Backpropagation Neural Network

The BPNN model is considered as the most widely used type of artificial neural network (ANN) model [24]. BPNN has the properties of forward propagation of the signals and backpropagation of the errors. The general structure of BPNN has three layers: an input layer, a hidden layer, and an output layer (Figure 1). The weights and thresholds in the architecture could be adjusted automatically according to the backpropagation of errors.

2.6. Support Vector Machine

The SVM, which is proposed by Vapnik and his colleagues, is a statistical method based on Vapnik-Chervonenkis (VC) dimension theory and the structural risk minimization (SRM) principle [2527]. The target of SVM is to obtain the best compromise between complexity and learning ability of the model based on limited sample information. SVM has been applied in a considerable amount of studies in NIR spectroscopy as a discrimination method [2830]. In this experiment, the radical basis function (RBF) was introduced as the kernel function for SVM models because RBF could reduce the computational complexity of training process. Cost parameter, which determines the trade-off between minimizing the training error and minimizing model complexity, and gamma parameter, which implicitly defines the nonlinear mapping from input space to certain higher dimensional space, are two main parameters in RBF kernel function. In the establishing processing of SVM models, the most important step is optimizing these two parameters [28].

2.7. Extreme Learning Machine

ELM, proposed by Huang [31], is a single-hidden layer feed forward neural network (SLFNs). ELM has advantages of learning fast with higher generalization performance and overcoming difficulties that are common in traditional gradient-based learning algorithms. The general structure of ELM is composed of three layers: an input layer, a hidden layer, and an output layer. It is important to select the number of neurons in the hidden layer for establishing a reliable ELM-based model because the number of neurons in the hidden layer influences the robustness and performance of the ELM model greatly. ELM randomly chooses and tunes the weights between the input layer and the hidden layer using continuous probability density function and sets the weights between the hidden layer and the output layer analytically.

2.8. Data Processing

In this study, all calculations were performed in Matlab R2014b (The Mathworks Inc., USA) under Windows 7 with 3.6 GHz CPU and 4 GB memory. Before the multivariate analysis, the raw spectral data were converted to absorbance values . To reduce the noise, Savitzky–Golay smoothing method was used, and the segment size was set to be 5. Then multivariate scatter correction (MSC), which modifies the additive and multiplicative effects [32], was applied on the denoised data. PCA and SPA were implemented to extract essential information from the whole spectral regions respectively, and then fewer variables were applied as the input for constructing models. Three different algorithms were used to establish models to identify apple varieties and geographical origins. These three methods were BPNN, SVM, and ELM. Figure 2 shows the flowchart of this procedure.

In this experiment, the performance of the models was evaluated through identification accuracy. The identification accuracy was defined aswhere indicates true positive, which represents a positive sample was classified as a positive example; indicates true negative, which represents a negative sample was classified as a negative example; indicates false positive, which represents a negative sample was classified as a positive example; and indicates false negative, which represents a positive sample was classified as a negative example. The higher the classification accuracy reached, the better performance the model obtained.

3. Results and Discussion

3.1. NIR Spectra of Apple Samples

Figure 3 shows the raw diffuse reflectance spectra of the experimented apple samples between 400 and 1021 nm. There are many crossovers and overlappings among these samples. The overall trends of the 300 apple spectra were similar. The absorbance spectra of all apple samples were similar, and there were no significant spectral differences among spectra of different varieties and geographical origins. Therefore, it was difficult to discriminate the variety and geographical origin directly based on the raw diffuse reflectance spectra. It is necessary to establish some models for identifying the apple variety and geographical origin.

3.2. Characteristic Wavelength Selection

A large amount of factors in the input vector may involve some irrelevant information and deteriorate the performance of models. In this study, we need to apply variable selection algorithms to the spectral data that had been preprocessed by Savitzky–Golay smoothing method and MSC. For PCA, Table 2 lists the contribution rates and accumulative contribution rates of the first 8 PCs. It showed that the first PC provided the most contribution (87.47%) and the second PC contributed to 7.14%. The first two PCs could represent more than 94% of the original spectral data. In this experiment, since more variables involving in the process of constructing models may slow down the calculation speed, the first 6 PCs, whose accumulative contribution rate reached 99.27%, were selected as the input to develop models based on the criterion for an increment of explained variance lower than 0.25% [33].

For SPA, Figure 4 presents the changing G value with the number of wavelengths. The overall trend of G value decreased with the increased number of wavelengths. The best number of wavelengths could be achieved when the decrease rate of G value was lower than 5%, while the decrease was less than 0.01 [34]. In this experiment, the best number of wavelengths was 10. The selected wavelengths were 530.36, 547.81, 564.45, 587.67, 606.53, 624.58, 652.45, 665.77, 677.99, and 996.84 nm, respectively.

3.3. Classification Models

For the BPNN models, the optimal number of neurons in the hidden layer was calculated by using leave-one-out cross validation in the range from 10 to 50 (large number of neurons in the hidden layer had no performance improvement in the experiment). Table 3 lists the optimal number of the neurons in the hidden layer of BPNN models. The results for apple variety and geographical origin identification by BPNN models are given in Table 4. Obviously, the average performance was poor. Only identification accuracy of SPA-BPNN model on the calibration set was more than 50%.

For the SVM models, the optimal values of cost and gamma parameters were selected through the leave-one-out cross validation method. Table 5 lists the optimal values of cost and gamma parameters of SVM models. Table 6 gives the identification accuracy for apple variety and geographical origin by using SVM models. It can be seen that PCA-SVM model outperformed SPA-SVM model with average accuracy 98.89% on the calibration set and 85.83% on the prediction set.

When the variables extracted by PCA and SPA were used as the inputs for ELM models, the best numbers of the neurons in the hidden layer are listed in Table 7. The identification results of ELM models are shown in Table 8. It showed that all ELM models could efficiently distinguish apple variety and geographical origin for both the calibration set and the prediction set with accuracy more than 90%. In particular, the identification accuracy of SPA-ELM model could reach more than 95% for both the calibration set and the prediction set.

When comparing all the established models, it could be found that ELM models had better performance than BPNN models and SVM models. It suggested that suitable intelligent algorithm could improve prediction performance. PCA-SVM model could identify Red Star apple from Shandong and Gala apple from Shandong with 100% accuracy, while SPA-ELM model has the ability of identifying Fuji apple from Shandong, Fuji apple from Shaanxi, and Red Star apple from Shandong with 100% accuracy.

4. Conclusion

In this study, the experiment results indicated that it was possible to develop a nondestructive technique to discriminate apple variety and cultivated region using NIR spectroscopy. PCA and SPA were applied to conduct variable selection of the spectra. BPNN, SVM, and ELM were used to establish models for distinguishing apple variety and geographical origin by using NIR diffuse reflectance spectroscopy. The results obtained from the experiments showed that it was feasible to identify apple variety and geographic origin using NIR spectroscopy in the range between 400 and 1021 nm, and the techniques were nondestructive. ELM models could give better identification accuracies than BPNN models and SVM models. For ELM models, PCA and SPA could both provide good performances. In particular, SPA-ELM could offer 98.33% accuracy for the calibration set and 96.67% accuracy for the prediction set. Therefore, this pattern would be suitable and effective for variety and geographical origin identification of apple fruits.

Data Availability

The spectral data used to support the findings of this study are included within the supplementary material.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This study was supported by CERNET Innovation Project (Grant no. NGII20150603), Science and Technology Innovation Project of Foshan City, China (Grant no. 2015IT100095), the Fundamental Research Funds for the Central Universities (Grant no. lzujbky-2016-br03), and Science and Technology Planning Project of Guangdong Province, China (Grant no. 2016B010108002).

Supplementary Materials

The supplementary material contains the spectral data we used to support the findings of our manuscript. In the data file, column A gives apple variety, column B gives sample ID, and the other columns show the spectral data. The first row presents the wavelength value. (Supplementary Materials)