Abstract

Quality assessment of diesel fuel is highly necessary for society, but the costs and time spent are very high while using standard methods. Therefore, this study aimed to develop an analytical method capable of simultaneously determining eight diesel quality parameters (density; flash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) through attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy and the multivariate regression method, partial least square (PLS). For this purpose, the quality parameters of 409 samples were determined using standard methods, and their spectra were acquired in ranges of 4000–650 cm−1. The use of the multivariate filters, generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), was evaluated to improve the signal-to-noise ratio of the models. Likewise, four variable selection approaches were tested: manual exclusion, forward interval PLS (FiPLS), backward interval PLS (BiPLS), and genetic algorithm (GA). The multivariate filters and variables selection algorithms generated more fitted and accurate PLS models. According to the validation, the FTIR/PLS models presented accuracy comparable to the reference methods and, therefore, the proposed method can be applied in the diesel routine monitoring to significantly reduce costs and analysis time.

1. Introduction

Diesel fuel is a petroleum-derived product of great importance for a country’s economy since most of the transportation of industrial and agricultural products depends on diesel vehicles [1, 2]. This fuel is a complex mixture composed mainly of paraffinic, olefinic, and aromatic hydrocarbons ranging from 8 to 28 carbon atoms and, in a lower concentration, substances containing oxygen, nitrogen, sulfur, and metals [35]. The diesel composition is influenced by several factors, such as the origin of crude oil, operating variables of the refinery, the addition of fractions from cracking process, and the insertion of additives to increase engine performance [3]. Therefore, the fuel quality is susceptible to many variables until the fuel reaches the consumer. In this perspective, the monitoring of diesel quality parameters is extremely important for commercialization, engine performance, consumer rights, business competition, and environmental risks [5, 6].

The assays performed to ensure the diesel quality are based on standardized procedures that require specific equipment to determine each physicochemical parameter. According to the standard methods, the quality assessment requires considerable sample volume and analysis time, besides the great expense of equipment maintenance and several specialized analysts [712]. Therefore, the development of methods to monitor diesel quality accurately, quickly, and environmentally friendly is highly necessary [13]. This becomes possible by attenuated total reflection Fourier transform infrared (ATR-FTIR) spectroscopy associated with multivariate regression methods such as partial least square (PLS). Studies demonstrated the possibility to predict some diesel properties using midinfrared spectroscopy combined with chemometric tools [1418], some aimed at the prediction of biodiesel content [16], and others were devoted to the identification of diesel adulteration with waste vegetable oils [17, 18].

In USA, European Community, and Japan, the regulations of diesel properties for consumption are established, respectively, by ASTM D975, EN 590, and JIS K2204 [1921]. In Brazil, the regulation and supervision of fuels are performed according to ANP (National Agency of Petroleum, Natural Gas, and Biofuels) Resolution no. 30/2016, which requires that assays must be conducted according to ASTM, EN, or NBR standards [22]. According to this resolution, at least eight quality parameters of diesel are analyzed in official monitoring laboratories: aspect; color; density; flash point; total sulfur; volatility (distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery); cetane index; and biodiesel content [23].

The development of an alternative method for determining the physicochemical parameters of diesel through ATR-FTIR has several advantages for routine quality monitoring. The use of ATR-FTIR can reduce costs, increase analytical frequency, use smaller sample volume, and provide the determination of all required parameters using only one equipment. Moreover, infrared spectrometers are already purchased by monitoring laboratories for determination of biodiesel content in diesel according to EN 14078.

In view of the high costs and long time required to assess diesel quality by standard methods, this work aimed at the development of a simple and fast analytical method based on ATR-FTIR analysis and PLS regression method to determinate eight diesel quality parameters simultaneously. In this study, multivariate filters and variable selection techniques, such as genetic algorithm (GA), forward interval PLS (FiPLS), and backward interval PLS (BiPLS), were evaluated for the best model predictive ability.

2. Materials and Methods

2.1. Samples

For eight months, the quality parameters of 3549 samples of diesel fuel were analyzed by Cempeqc (Center for Monitoring and Research of the Quality of Fuels, Biofuels, Crude Oil and Derivatives) according to ASTM and EN standards. The samples were stored at 10°C for further spectroscopic analysis. The standards and equipment used in the determination of quality parameters are presented in Table 1.

Although an extensive sample set can provide greater robustness to a prediction model, this work aimed at the development of a simple method that can be easily reproduced by other laboratories. Therefore, we selected about 10% of the diesel samples for spectroscopic and chemometric analysis. The 3549 diesel samples were divided into groups using hierarchical cluster analysis (HCA) to select the most representative samples. An HCA was executed for each month, and the physicochemical parameters were used as variables. The clusters were performed using 60% of similarity, complete linkage method, and autoscale preprocessing to give the same influence for all variables. The software used for HCA was Pirouette (Infometrix), version 3.11. At the end of the eight months, 409 diesel samples were selected.

2.2. Spectroscopic Analysis

The infrared spectra of the 409 samples were obtained by a Nicolet 6700 FTIR spectrometer (Thermo Scientific, Waltham, USA) using 32 scans and 4 cm−1 resolution. A Smart ARK ATR sampling accessory of ZnSe crystal and angle of incidence 45° were used to acquire the infrared spectra. The ATR accessory required one milliliter for each sample, and a new background spectrum was acquired every hour to reduce the baseline shifting and ambient variations. The conditions of temperature and relative humidity during the analysis were 20.7 ± 2.0°C and 40 ± 9%, respectively.

2.3. Chemometric Analysis
2.3.1. Model Development

The chemometric analysis was executed using Matlab 2013a (MathWorks) with PLS toolbox 7.3.1 (Eigenvector Research Inc.). The FTIR spectra were converted into vectors of 1738 variables and the combination of the vectors resulted in the matrix X of dimension 409 by 1738. Prior to the development of PLS models, the sample set was separated into two-thirds for calibration (273 samples) and one-third for validation (136 samples). The Onion algorithm was used to select the samples with less covariance (based on distance from the mean) for each set and, consequently, to obtain greater sample representativeness in both sets [24, 25]. The algorithm was performed for each parameter to ensure that the calibration set had the largest range of reference values.

Initially, the PLS models were developed using the full spectra (full X-block) preprocessed by the mean center or autoscale, depending on the best fit. The number of latent variables (LV) was chosen based on the root mean square errors of calibration (RMSEC), cross-validation (RMSECV), and prediction (RMSEP) in order to minimize the prediction errors and avoid model overfitting [26, 27]. The cross-validation was performed using venetian blinds mode with 10 splits.

Then, statistical tests were applied according to ASTM E1655 [28] to detect the presence of outliers in the calibration and validation sets. Outliers include high leverage samples and samples whose reference values are inconsistent with the model. Therefore, samples with high leverage and studentized residuals were excluded from the sample sets.

2.3.2. Preprocessing Evaluation

Spectral data usually present baseline shifting due to instrumental variations and reflectance deviations [29]. The baseline shifting is typically corrected by applying the first or second derivative, or by polynomials that correct the displacement based on a standard spectrum, for example, multiplicative scatter correction (MSC) and standard normal variate (SNV). In addition, digital filters such as smoothing are also used to improve the signal-to-noise ratio of spectral data [28]. Multivariate filters, such as generalized least squares weighting (GLSW) and orthogonal signal correction (OSC), are less usual preprocesses, but these filters are very useful to eliminate baseline shifting and increase signal-to-noise ratio [3033]. Therefore, the following preprocessing was evaluated in modeling: mean center, autoscale, Savitzky–Golay smoothing and derivatives, SNV, MSC, GLSW, and OSC.

2.3.3. Variable Selection Methods

Many studies have shown that variable selection is an efficient way to increase the signal-to-noise ratio and, as a consequence, improve the predictive ability of the model [34, 35]. When the noise dominates over the information related to the property of interest, the removal of variables often leads to better accuracy and performance of the analytical method [35, 36]. The selection of variables can be performed based on the spectral knowledge (manual approach) or through algorithms that search for variables that provide the minimum prediction error to the model. Some of the most popular methods for selecting variables are the interval selection method, such as the forward interval PLS (FiPLS), the backward interval PLS (BiPLS), and the genetic algorithm (GA), a technique that employs a probabilistic and nonlocal search process which manipulates binary strings with the coded experimental variables. Details on these variable selection methods can be found in [35].

In this study, four different approaches were evaluated to select variables: manual exclusion, FiPLS, BiPLS, and GA. The manual exclusion was carried out evaluating the spectral residues and loadings plots. Spectral regions with no absorbance or high relative standard deviation (RSD) were excluded from the data and compared with results obtained using the full spectra. Both iPLS methods were executed using interval size of 25 variables, and the number of intervals was determined by the algorithm to obtain the lowest value of RMSECV. The GA was performed with a population size of 128 models, one variable by window, initial terms of 30%, the mutation rate of 0.5%, double crossover, 200 generations, and PLS regression method. All approaches were performed using only the calibration set to avoid overestimated results.

2.3.4. Model Validation

The PLS models were statistically evaluated by figures of merit (FOM) according to ASTM E1655 and Valderrama et al. [28, 37]. The accuracy of the models, defined as the degree of agreement between a measured value and reference value, was assessed by the values of RMSECV, RMSEP, correlation coefficients (r), average relative errors (ARE), and relative percent difference (RPD). The RMSECV was obtained by cross-validation using the venetian blinds mode with 10 splits, and the RMSEP was obtained by the validation samples that were measured independently from the calibration samples. Then, the RMSECV and RMSEP were compared with the reproducibility of the reference method. The ARE was used as a parameter to evaluate the magnitude of the prediction errors in relation to the reference values [38]. The ARE value was calculated by where yi and correspond, respectively, to the reference value and predicted value by the model and nv is the number of validation samples. The relative percent difference (RPD) was obtained by the ratio of the standard deviation of the validation set reference values to the RMSEP value. RPD values above 2.5 indicate that the model has acceptable accuracy over the measurement range, while values above 10 are considered excellent for alternative methods [39].

Linearity is an important parameter to evaluate the performance of the model since the PLS regression method is not suitable for nonlinear relationships between the variables x and the property of interest [40]. The linearity corresponds to the ability of the model to provide results directly proportional to the property of interest. One way to evaluate this parameter in multivariate models is through the residues of calibration and validation samples plots. If the distribution of residues is random, it can be said that the model shows a linear behavior. In addition to the residue plots, the linearity was also evaluated by the values of determination coefficients (R2) and bias. This last FOM indicates the presence of systematic errors in the model. Bias can be assessed by a t-test for the validation samples at a confidence interval of 95%. The average bias was calculated by summing the differences between the reference value and the predicted value divided by the number of validation samples [28]:

Then, the standard deviation of validation errors (SDV) was calculated asand finally, the value of tbias was given byIf the value obtained for tbias was greater than the critical value for nv − 1 degrees of freedom, then the multivariate model presented significant systematic errors.

The precision of the models was evaluated by the analysis of 14 replicates of 30 diesel samples performed on different days. The average of relative standard deviations (RSD) and the intermediate precision—calculated through (5), where n is the number of samples and m the number of replicates—were used as parameters [37]. Then, the intermediate precision was compared to the repeatability value of the reference method:

3. Results and Discussion

3.1. Physicochemical Assays

The values of reproducibility and repeatability of the reference methods, the range of measured values of each quality parameter, and the number of samples in nonconformity with ANP Resolution no. 65 [41] are shown in Table 2. The quality parameter that presented the highest number of nonconforming samples was T10, followed by T85 and biodiesel content. As ANP Resolution no. 65 allows only a variation of 0.5% (v/v) of biodiesel content, most of the samples were in a narrow range of concentration. The same occurred with the total sulfur but in two different ranges of concentration due to the availability of two types of commercial diesel with distinct sulfur content.

3.2. Spectroscopic Analysis

The FTIR spectra of all diesel samples are represented in Figure 1. Functional groups of the constituents of samples could be observed by characteristic absorption bands of each group of atoms through the infrared spectra.

The most intense bands were caused by C–H groups stretch (3000–2800 cm−1) and angular deformations (1464 cm−1 and 1379 cm−1) [42]. The bands at 2350 cm−1 and 667 cm−1 were, respectively, results of asymmetrical stretch and angular deformation of CO2 molecules present in the atmosphere [43]. The presence of biodiesel in the samples was observed by carbonyl absorption band (1750–1735 cm−1) and aliphatic ester absorption band (1300–1000 cm−1). Aromatic compounds had characteristic bands of low intensity in 900–675 cm−1 from the C–H out-of-plane angular deformation. The sulfur is present in diesel as mercaptans and sulfides, and it was observed by S–H axial stretch at 2600–2550 cm−1 and C–S axial stretch at 700–650 cm−1 [43, 44]. The S–H stretch was very weak; however, few groups have absorption in this region, so it was useful for the total sulfur parameter. The vibrational group attribution to each band is present in Table 3.

3.3. Chemometric Analysis
3.3.1. Outlier Detection

During calibration, outlier statistics were applied to identify samples that had unusual leverage and studentized residuals. The outlier detection was performed prior to the variable selection because the exclusion of variables may reduce outlier detection capabilities of the model [28]. The number of outliers from each sample set is shown in Table 4. Considering the calibration and validation sample set with, respectively, 273 and 136 samples, the number of outliers (3% maximum) was not significant for the prediction models.

High studentized residual values may be the result of errors in the reference measurement, spectral acquisition error, reference value transcription, or even a failure of the model. Error in the spectral acquisition would lead to the presence of the same outlier in all models of prediction; however, different outliers were detected for each model. The absence of new outliers in the model after removal of the anomalous samples indicated that there was no failure in the model. Therefore, errors in the reference values were most likely responsible for the presence of outliers.

3.3.2. Preprocessing Evaluation

The baseline shifting in the raw spectra was observed in Figure 1. The shifting may be the result of variations in the position of the ZnSe crystal since it was removed from the spectrometer for cleaning before each analysis. All the evaluated preprocessing—derivatives, MSC, SNV, GLSW, and OSC—provided baseline correction and higher correlation coefficients than mean center or autoscale preprocessing. Moreover, multivariate filters (GLSW and OSC) provided models with greater explained variance using fewer latent variables (Table 5). Therefore, all models were preprocessed using OSC, except the model for T85, which presented better fit with GLSW preprocessing.

3.3.3. Variable Selection

The exclusion of regions without information of sample constituents or low signal-to-noise ratio may improve the performance of the models [35]. In Figure 2, the noisy spectral regions can be observed through the relative standard deviation (RSD), represented by the blue line, and calculated from the mean of 14 replicates, represented by the red line. In addition, there was no absorption by the components of diesel in the ranges 4000–3100 cm−1 and 2450–1950 cm−1; thus, these spectral regions were excluded, and new models were developed.

The RMSEP and correlation coefficient of validation (rval) obtained by the different variable selection approaches are presented in Table 6. The manual exclusion of variables provided better results only for the prediction models of flash point, T10, cetane index, and biodiesel content. The manual selection of variables had the risk of inadvertent exclusion of important variables for the modeling, impairing the performance of the model.

The selection of variables by interval selection methods reduces the values of RMSEC and RMSECV but might decrease the predictive ability of the model. The FiPLS method usually uses few intervals to correlate the spectral variables with the property of interest and, as consequence, the calibration model is more susceptible to overfitting and the prediction of unknown samples is impaired, especially properties that are correlated to several spectral variables. As the sulfur content is correlated only to the S–H and C–H bond variables, the FiPLS method provided the best fit to the model.

The distillation temperatures and the cetane index depend on the size and structure of the hydrocarbon chains of the diesel components; therefore, these are properties related to several functional groups with response in the midinfrared region. Thus, the selection methods such as BiPLS, which seek to exclude noisy variables rather than including variables more correlated to the property of interest, tend to be more suitable for optimization of these diesel parameters.

The analytical signals in the midinfrared region result in many correlated variables; that is, FTIR data present many collinearities. Normally, the problem of collinearity can be attenuated by the application of the genetic algorithm, since the spectral variables are manipulated in binary strings and the search for variables that provide a minor error of prediction is performed by a probabilistic and nonlocal process [35]. GA was the best variable selection approach for prediction models of density, flash point, T50, and biodiesel content.

In general, the selection of variables by iPLS and GA provided improvements in the predictive ability of the calibration models, except for T85, and the difference between the results obtained by both algorithms was not significant. The selected variables used in the best-fitted models are presented in the supplementary material (available here) attached to the article.

3.3.4. Model Performance

After defining the most appropriate variable selection method for each parameter, the figures of merit were determined for the prediction models (Table 7). The complexity of the diesel composition, consisting of hundreds of compounds, generates a large amount of information in the FTIR spectra and, therefore, the correlation between the matrix X and the property of interest requires a considerable number of LVs. Although the use of OSC reduces the collinearity problem and increases the captured variance of the X and y blocks, several analytical signals were correlated to the properties of diesel, so several LVs were required.

The accuracy of the models was evaluated by comparing the RMSEP values (Table 7) with the reproducibility values of the reference methods (Table 2). Since all models presented RMSEP values below or equivalent to the reproducibility value, the FTIR/PLS method could be considered accurate for predicting diesel parameters. In addition, the correlation coefficients were above 0.89, except for T85; thus, the predicted values were well correlated with the reference values (Figure 3).

Although the prediction model for sulfur content presented rval equal to 0.987, the obtained ARE value was high when compared to the others. The high relative errors that resulted in ARE equal to 14.10% were caused by the low sensitivity of the model for prediction of S500 diesel samples. However, the RPD value indicated that the model was accurate when the RMSEP value is compared to the sulfur content range of the validation sample set.

The determination coefficients (R2) indicated that the prediction models of flash point and T85 presented lower linearity than the other parameters. Figure 3 shows that, for these parameters, the residues tend to be negative values with the increase of the reference value. Although the models have low bias values, the t-test revealed that there were systematic errors in the prediction models for sulfur content, T85, and biodiesel content. Since the models for sulfur and biodiesel content presented good linearity (R2val > 0.88), the presence of systematic errors can be reduced by the addition of more samples to the model.

The precision of the models was evaluated by the analysis of 30 diesel samples on 14 consecutive days. Although the samples were stored at 10°C between the analyses, the diesel fuel consists of semivolatile compounds and, therefore, changes in sample composition during the replicates acquisition imply an increase in measurement uncertainty. The intermediate precision values of the models were above the repeatability values of the reference methods, except for the biodiesel content prediction. However, the RSD values showed that almost all models had good precision (RSD below 1%) and only the models for prediction of flash point and sulfur content presented low precision.

Although the prediction models for flash point, sulfur content, and T85 have the limitations mentioned above, the conformity ranges of these parameters (Table 2) can be met by FTIR/PLS models with reliability since the accuracy and precision of the method are known. If an unknown sample is analyzed by FTIR and the result obtained is in the nonconformity range, it is recommended that the result is confirmed by the standard method. Since only about 2% of the diesel samples in Brazil presented nonconformities in 2017 [45], the FTIR/PLS method can be applied in routine monitoring of diesel quality to reduce the costs and time of analysis.

4. Conclusions

This study showed the possibility of applying ATR-FTIR spectroscopy with PLS regression method to predict the quality parameters (density; flash point; total sulfur content; distillation temperatures at 10% (T10), 50% (T50), and 85% (T85) recovery; cetane index; and biodiesel content) in commercial diesel samples.

All the evaluated preprocessing (derivatives, MSC, SNV, GLSW, and OSC) provided baseline correction and higher correlation coefficients. In addition, the GLSW and OSC preprocessing provided greater explained variance to the model using fewer latent variables. The selection of variables by iPLS or GA provided better predictive ability to the calibration models, except for T85. However, the difference between the results obtained by both algorithms was not significant.

According to the model validation, all PLS models presented acceptable accuracy when compared to the values of reproducibility and had good precision, except for sulfur content prediction of S500 diesel samples. Since the application of the ATR-FTIR/PLS method is able to reduce costs and increase considerably the analytical frequency, the diesel quality monitoring programs, as well as the final consumer, can benefit greatly from the application of the proposed method.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Capes, CNPq, and Fundunesp for providing academic scholarships and Cempeqc for providing financial support and the samples.

Supplementary Materials

In the supplementary material can be visualized the variables used in the PLS models that presented better predictive abilities. The variable selection method used for each diesel property is also presented in the supplementary material. Supplementary Figure 1: spectra variables of GA-PLS model for density prediction. Supplementary Figure 2: spectra variables of GA-PLS model for flash point prediction. Supplementary Figure 3: spectra variables of FiPLS model for sulfur content prediction. Supplementary Figure 4: spectra variables of BiPLS model for T10 prediction. Supplementary Figure 5: spectra variables of GA-PLS model for T50 prediction. Supplementary Figure 6: spectra variables of PLS model for T85 prediction. Supplementary Figure 7: spectra variables of BiPLS model for cetane index prediction. Supplementary Figure 8: spectra variables of GA-PLS model for biodiesel content prediction. (Supplementary files)