Aiming at effective outlier elimination in the biological near-infrared spectral and achieving high accuracy predictive modeling, this paper proposes a novel outlier elimination method based on variance and leverage analysis. Firstly, the characters of near-infrared spectral are summarized; then residual sample -variance, leverage, and residual sample -variance are concatenated as a divergence measurement. We further compared the proposed method with variance, Mahalanobis distance, and Hotelling statistical analysis; the experiment results demonstrate that the proposed methods have competitive outlier elimination and better performance in time complexity and accuracy. The proposed method can also be adopted for other outlier elimination tasks.

1. Introduction

In the past few years, quality safety of food and supplies is becoming increasingly severe in China with rapid advance of national economy [15]. Rapid detection of quality is getting higher in industrial manufacturing, agricultural production, and commercial popularizing. Near-infrared spectroscopy is paid close attention to as a new direction and method of rapid detection, which develops hugely in agriculture, food, medicine, chemical, and so forth because of its advantages of high analysis speed, efficiency, nonpollution, and easy online detection [615]. However, as a kind of indirect analysis technology, near-infrared spectroscopy needs to establish the analysis model between spectral information and the nature of the data samples, parse out the correlation between the various spectral information and sample properties, and use the resulting calibration model to predict unknown samples in order to achieve the purpose of rapid detection [1620]. Therefore the accuracy of the selected data is to be able to achieve ideal to predict the results.

However, during the whole process of spectrum data collection, it is likely to cause model nonrepresentative such as outlier samples due to experimental errors or sample collection and classification of uneven [2125]. The existence of the stray samples will be affected to some extent and even change the distribution trend of overall data; thus they affect the accuracy of the calibration model. So quick and efficient sample removed from the group is the key to establishing correction model. At present, several commonly used analysis methods are based on multivariate statistical analysis to judge whether a statistics is beyond a specific threshold [2628] (beyond a certain threshold), which have a certain effect, but the tested samples after the replacement of material data method are not reliable and sometimes cannot get satisfactory results.

In this paper, on the basis of analyzing several kinds of methods, we put forward a method based on ( represents the spectral information and chemical representative sample number) variance VS leverage value method of 3D sample removed from the group, using residual sample -variance, leverage (leverage value), and the residual sample -variance as three-axis direction through the 3D view overall to judge from soy oleic acid value model [29, 30] and wheat straw biomass model [30, 31] from the group of the distribution of the sample. By comprehensive comparison of variance, Markova distance, and Hotelling statistics as the traditional method to sample out from the group, its effect is ascended and improved obviously by repeating many times in the 3D view out the analysis. Final calibration model soybean oil acid value model and wheat straw biomass model were promoted from 0.8653109, 0.843431 to upgrade to 0.966022, 0.934227. RSD (relative standard deviations) were also decreased to 5% and 8%.

2. Materials and Methods

2.1. Samples Collection

Wheat straw and soybean oil were experimental verification objects, which measured wheat straw biomass (fermentation process) and soybean oil component in acid value, respectively. Wheat brans were collected from different places in Heilongjiang Province, whose total number was 123. And samples of soybean oil were deployed by manufacturing enterprise in various acid values. Kennard-Stone algorithm was used to calculate Euclidean distance among absorbance spectra of samples. In the meanwhile, the most representative samples were selected as calibration set and its sample distribution was as shown in Table 1. The unit of soybean acid value is mg KOH/g which represents desired quality of potassium hydroxide in 1 g fat of free fatty acids. Unit of wheat bran straw biomass (fermenting microorganisms) is mg/g which represents microbial cell density per unit volume. Its distribution of the sample is shown in Table 1; their sample classification is shown in Tables 2 and 3.

2.2. Spectral Acquisition

In the experiment using Thermo Antaris II near-infrared spectrometers scan for soybean oil and wheat bran, which range from 830 nm to 2500 nm (12000 cm−1–4000 cm−1) the resolution of 8 cm−1. Respectively, using sweep surface mode of liquid transmission and integration ball, empty transparent glass is selected to contrast and air is used as a comparison object sphere scanning. Before sample surface scanning, we set the number of background scan 32 times and experiment scan 32 times. Its sample scan results as shown in Figure 1, as sequence is soybean oil and wheat straw biomass spectrum.

2.3. Samples’ Chemical Calibration

(1) Oil samples are dissolved with mixed neutral alcohol-ether solvent; then free fatty acids are titrated with alkali standard solution. According to the quality of oil and amount of alkali consumption, acid value is calculated. We need the following orders including reagent, instruments and appliances, and operation.

① Reagent. Consider the following:the 0.1 mol/L KOH (or sodium hydroxide standard solution);the neutral ether-ethanol (2 : 1) mixed solvent, with 0.1 mol/L alkali to titrate to neutral before using;the l g/100 mL indicator of phenolphthalein-ethanol.

② Instruments and Appliances. Consider the following:the burette of 25 or 50 mL;the Erlenmeyer flask of 250 mL;the balance of sensitivity of 0.001 g;the volumetric flask, pipette, the weighing bottle, reagent bottle, a graduated flask, and beakers.

③ Operational Approach. The main steps are as follows: firstly, uniform specimen (3–5 g) is weighed to inject into the conical flask. Secondly, neutral mixed solution (50 mL) is added and the conical flask is shaken to dissolve the internal solution completely. Then 3 drops of phenolphthalein indicator are added to the conical flask and mixed. Lastly, 0.1 N solution of potassium hydroxide is titrated to the conical flask to show reddish and to maintain 30 s. Also, the consumption of potassium hydroxide solution mL number is written down. Finally, oil acid is valued by the formwhere = titration consumption potassium hydroxide solution volume (mL), = the concentration of KOH solution (mol/L), 56.1 = the molar mass of KOH (g/mol), and = sample quality (g).

The experimental results allow the error not to exceed 0.2 mg KOH/g and the average is the determination results. The distribution range of oil acid value (52) is measured of 0.473~3.102 mg KOH/g.

(2) Fermentation microbial biomass is chosen as the research object in the experiment. Monitoring biomass of bacterial colony, which is measured by glucosamine method, is vital. Its biomass stands for microbial density per unit volume. The preparation and determination process is as follows.

① Reagents. Concentrated sulfuric acid, acetyl acetone, sodium hydroxide, peptone, sodium nitrate, magnesium sulfate, potassium dihydrogen phosphate, and dibenzaldehyde are used.

② Instruments and Appliances. The following are used: UV3000 UV-visible spectrophotometer, FA1004 electronic balance, and SPX-250B-Z-type incubator.

③ Operation Calculation. Firstly, dry cell (0.1 g) is weighed precisely and solid-state fermentation substrate (0.5 g) with sulfuric acid (2 mL 60%) is soaked for 24 hours, diluting to 1 mol/L then placing into flask (250 mL) with heating one hour in 9.8 × 104 Pa high-pressure. After cooling, diluent is neutralized with sodium hydroxide (1 mol/L) to pH7.0, setting the volume to 100 mL. Secondly, according to Elson Morgan, absorbance is measured at 530 nm based on five parallel samples per specimen, choosing the average value as the absorbance of the sample. At last, distilled water (2 mL) was used as a blank cell in measuring the quality of solid-state fermentation matrix glucosamine.

Biomass is estimated as follows:where = unit mass culture glucosamine (%) and = unit mass of bacteria in the glucosamine content (%).

3. Results and Discussion

Aiming at the acid value model for soybeans and wheat straw biomass, respectively, model variance, Mahalanobis distance, and Hotelling statistics, we compare leverage 3D and variance method. Meanwhile, we select 4450 cm–5000 cm−1 model features band as soybean acid number and 9000 cm–10000 cm−1 model features as biomass wheat straw band. Unscrambler 10.2 software and the preparation of the model itself Matlab programming are used to analyze. Waveband selection is shown in Tables 4 and 5.

3.1. Variance Analysis

First, the samples for the correction model of variance were analyzed where (residual sample -variance) represents sample spectrum and (residual sample -variance) represents the chemical values of samples. Under normal circumstances, the residuals are calculated for the : the greater the variance of a sample of is, namely, correct model for its ability to fit is weaker, the lower the explanatory power is. From Figure 2, it can be seen that (Figure 1(a) shows the spectrum of acid value of soybean, Figure 1(b) shows the spectrum of wheat straw), in soybean acid value model, two samples, number 44 and number 46, have significantly higher variance and in biomass straw bran model, four samples, numbers 15, 18, 43, and 99, have a relatively high variance, which can be regarded as outliers (abnormal) samples to be removed.

3.2. Mahalanobis Distance Analysis

In near-infrared spectroscopy, Euclidean distance and Mahalanobis distance are important method for determining the abnormal samples. Compared to the Euclidean distance, Mahalanobis distance as taking into account the links between the various characteristics is widely used. In this experiment, two calibration models were Mahalanobis distance calculation, the calculation results shown in Figure 3. As can be seen from the figure, in the acid value of soybean model, a sample of 52 represents strongly and small differences are in the overall Mahalanobis distance. Maximum distance of two samples numbers 40, 49 is trimmed. In wheat bran straw biomass models, the samples exhibit greater volatility in Mahalanobis distance, where numbers 47, 52, and 90 samples have significant differences and should be considered rejected, due to the messy sample distribution (min 2.9–max 99.2).

3.3. Hotelling Statistical Analysis

Hotelling statistics in the multivariate statistical analysis is a kind of important statistics. It is a two-dimensional elliptic model based on Principal Component Analysis (PCA), which is mainly used to test the stability of multivariate. If principal components of all observables are stable, its statistic would be maintained at a stable level. Then, abnormal situations are detected by associated critical limit. From the figure, when the limit is 5%, in the soybean oil acid value model, samples 48, 49 obviously deviate from the center of the circle, far beyond the limit, and in the wheat straw bran biomass model, samples 13, 14, and 17 are also beyond the limit, which can be considered excluded samples from Figure 4.

3.4. 3D View Analysis of Variance and Leverage Value

Calculate the leverage [3234] value and variance [35] value in the model of acid value of soybean and wheat straw biomass, the leverage value is very useful to detect whether the sample is far from the space center of model. Samples with high leverage value are likely to be outliers, which also has a great influence on the model accuracy. The outlier (abnormal) samples to establish the 3D view of the Leverage and the variance, which select residual sample -variance as the -axis, leverage (leverage) as the -axis, and residual sample -variance as the -axis and comprehensive judgments, as can be seen from Figure 5. In the process of the entire model fitting, most of the samples are uniformly distributed in the center of the 3D view, but a small portion of the samples is far away from the center, and the distance between the center and the far sample of the variance and the leverage is very large. As shown in Figure 5, it can be seen that the samples 44, 46 in the soybean oil acid value model deviate significantly from the center far; the result of its analysis is consistent with the variance; in the wheat straw bran biomass model, samples 15, 23, 70, and 83 have obvious abnormalities, Determining integrated values in three directions, the above several points can be excluded.

3.5. Model Demonstration

Methods for excluding outliers above sample select cross validation (cross validation) to verify. Figure 6 shows, where soybean oil with 52 samples and 123 samples of wheat straw bran established PLSR (Partial Least Squares Regression) model without any discrete rejection, we can see that the decision coefficient of soybean oil acid value model is 0.8653109 and the value of wheat straw bran biomass model is 0.870438; both the coefficients of determination are not high, and from the marked area in the Figure 7 we can see that the samples 44, 46 in the soybean oil acid value model are separated from the entire calibration curve, which is consistent with our previous conclusions. In the wheat straw bran biomass model, because of the existence of some stray samples, the value of sample 1 in the process of fitting is 0, instead of the original chemical numerical error which is very large.

After various methods for excluding outliers, the accuracy of the model is shown in Table 6. In the soybean oil acid value model, variance method for selecting outliers is consistent with 3D view analysis and both the best main factors are 5. However, the accuracy of calibration model after Mahalanobis distance analysis and Hotelling’s statistics is even slightly lower. In the wheat straw bran biomass model, the best method is based on the 3D view analysis. The best main factor is 27; variance, Hotelling’s statistics, and the Mahalanobis distance methods compared to the original spectrum also have improved.

Through Tables 6 and 7 analysis, it was found that the best two models after excluding outliers are based on 3D view analysis of variance and leverage value. In the wheat straw biomass model, the coefficient of determination of calibration upgrade to 0.911626 and root mean square error (RMSE) is 9.060135. In practical, we found that the accuracy of the model can be further improved if we repeated the 3D view analysis. Therefore, according to this idea, we continue to do this work like the above, which in Figure 7(a) is the wheat straw bran biomass model with samples 18, 39, 74 being excluded which in Figure 7(a) is the Wheat straw bran biomass model with sample 18, 21 are excluded. Figures 7(b)7(d) for turn samples 27, (9,7), 91, are excluded; the correction model coefficient of determination upgrade to 0.934227 and RMSE reduce to 8.4943037. It can be seen that the accuracy of the calibration model improved significantly after several times outliers excluding.

4. Conclusions

By the soybean oil acid value model and the wheat straw bran biomass model research experiment and comparative analysis to prove that in the near-infrared spectroscopy analysis methods for excluding outliers based on 3D view analysis of variance VS leverage are effective, comprehensive analysis and judgment of the methods, residual sample -variance, leverage (leverage), and residual sample -variance are more comprehensive accurate judgment of abnormal samples and the accuracy of the model improved significantly after several times outliers excluding. This method also for the future of near-infrared spectroscopy outlier samples excluded proposes a new direction.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


The authors would like to acknowledge the financial support from the National High-tech R&D Program of China (863 Program) (2013AA102303); Natural Science Foundation of Heilongjiang Province of China (F201402); and Key Technologies R&D Program of Harbin (2013AA6BN010).