#### Abstract

The types of crude oil for producing asphalt have a decisive influence on various performance measures (including aging resistance and durability) of asphalt. To discriminate and predict the crude oil source of different asphalt samples, a discrimination model was established using 12 greatly different infrared (IR) characteristic absorption peaks (CAPs) as predictive variables. The model was established based on diverse fingerprint recognition technologies (such as principal component analysis (PCA) and multivariate logistic regression analysis) by using attenuated total reflectance-Fourier transform infrared spectroscopy (ATR-FTIR). In this way, the crude oil source of different asphalt samples can be effectively discriminated. At first, by using PCA, the 12 CAPs in the IR spectra of asphalt samples were subjected to dimension reduction processing to control the variables of key factors. Moreover, the scores of various principal components in asphalt samples were calculated. Afterwards, the scores of principal components were analysed through modelling based on multivariate logistic regression analysis to discriminate and predict the crude oil source of different asphalt samples. The result showed that the logistic regression model shows a favourable goodness of fit, with the prediction accuracy reaching 93.9% for the crude oil source of asphalt samples. The method exhibits some outstanding advantages (including ease of operation and high accuracy), which is important when controlling the source and quality and improving the performance of asphalt.

#### 1. Introduction

Asphalt pavements are widely used: as a black binding material produced from oil, asphalt is widely used as the binder in asphaltic mixtures [1–3]. Due to the differences in origins and production modes of crude oil for producing asphalt, the properties of crude oil exert important influences on the performance of asphalt mixtures, which also lead to significant differences in the performance of the various asphalt produced therewith [4–8].

The conventional performance of the same grade of asphalt is very similar; however, different asphalt exhibit large differences in various aspects, including high- and low-temperature performance, durability, and fatigue properties, which are considered as external expressions of chemical composition, molecular structure, and transformation of asphalt [9–11]. Furthermore, the study shows that the differences in the composition and structure of asphalt mainly depend on the source of crude oil and refining process of asphalt production. Due to the differences in the geological structure, oil generation conditions, and age, the nature and composition of crude oil in different regions are very different. However, crude oil with similar properties and composition in the same region has similar processing, storage, and transportation options. At the same time, most of the petroleum asphalt is produced by distillation currently, and the molecules in the asphalt retain their original state in the crude oil. Therefore, most of the composition and structure of asphalt are inherited from crude oil; that is to say, the structural performance of asphalt mainly depends on the source of crude oil. Because the asphalt is produced by different types of crude oil, the physical and chemical composition information about asphalt is unique. Just like the human fingerprint information, these components which can express the unique structure of asphalt can be called the “oil fingerprint” of asphalt. It is because of the uniqueness of “oil fingerprint” information of asphalt that it is feasible to discriminate the oil fingerprints of asphalt from different crude oil sources [12–16].

At present, as the composition and structure of asphalt are extremely complex, the characterization of its structure requires more high-resolution and high-throughput analysis means and equipment, so there are few reports on the identification and analysis of asphalt oil fingerprints [17]. However, the identification and analysis of marine oil spill fingerprints has always been an issue of widespread concern. Similar to the method and purpose of identifying “oil fingerprints” of oil spills at sea, the purpose of recognising oil fingerprints of asphalt is to attain oil fingerprint information of asphalt through different methods such as physical, chemical, and biological methods [18]. Moreover, by applying multivariate statistical methods (including principal component analysis (PCA) and regression analysis), the chemical composition variables of oil fingerprints are summarised, classified, and discriminated [19, 20]. On this basis, qualitative and quantitative relationships between data are obtained to distinguish the crude oil source of asphalt, thus effectively controlling their qualities. Meanwhile, some testing methods used in the “oil fingerprint” identification of marine oil spills have been successfully used to analyse the composition and structure of asphalt [21–23]. For example, a gas chromatograph-mass spectrometer (GC-MS) was used to explore the chemical compositions of smoke released by asphalt materials during heating [24, 25]. Gel permeation chromatography (GPC) and thin-layer chromatography (TLC) were used to measure the molecular weights and the composition distributions of asphalt [26–28]. Nuclear magnetic resonance (NMR) and Fourier transform infrared spectroscopy (FTIR) were used to investigate the compositions, structures, and functional groups of asphalt [29, 30]. In all analytical techniques, compared with other methods (including GC-MS and NMR), which generally show some disadvantages (including high cost, damage to samples, and being laborious and time consuming during analysis), infrared (IR) spectroscopy is the most widely used technique in investigating asphalt materials. The reason is that IR spectroscopy shows many outstanding advantages, including being label-free, rapid, nondestructive, and low-cost, with simple sample preparation [31–33]; however, in the above analysis, the chemical structures of asphalt are qualitatively analysed, mainly aiming at those of a certain or multiple specific asphalt samples while lacking quantitative research into the types of asphalt. The research into discrimination of the types of asphalt, tracing of the production area, and quality control of asphalt has not yet been reported.

Therefore, by utilising attenuated total reflectance-Fourier transform infrared spectroscopy (ATR-FTIR), the characteristic functional groups of asphalt from different crude oil source were discriminated and quantitatively analysed. Based on multivariate statistics, PCA and logistic regression analysis were conducted on IR spectral data to establish a discriminant function. An accurate, nondestructive, stable method of discriminating the crude oil source of asphalt samples was explored, which provides a scientific basis for realising reasonable selection, supervision quality, and guaranteed origins of asphalt.

#### 2. Experimental Raw Materials and Methods

##### 2.1. Experimental Materials

During the experiment, 33 asphalt samples were purchased from factories in China for producing asphalt. Before being applied, the asphalt samples were sealed in original oxygen-free containers at 5°C to prevent the samples from being oxidised. Additionally, all asphalt samples were unprocessed before use. As mentioned in Section 1, the differences in the “oil fingerprint” of asphalt are determined by the crude oil from which it is produced. Due to the same geological structure, oil generation conditions, and age in the same region, the composition and chemical structure of crude oil are also very similar. Therefore, the “oil fingerprints” of asphalt produced by crude oil from the same region are very similar, such as crude oil from the Middle East Gulf region, including Saudi Arabia, Iran, Kuwait, Iraq, and United Arab Emirates, crude oil from South America, including Marry, Poscan, Maya, and Castilla, and crude oil from the Bohai Rim region of China, such as Bohai Bay, Huanxiling, and Caofeidian. The crude oil of 33 asphalt samples came from the above three regions. According to the names of the three regions, the crude oil source of asphalt is divided into three categories: Middle East, South America, and the Bohai Rim region of China. The basic performance measures (penetration ratio (ASTM D5), ductility ratio (ASTM D113), and softening point (ASTM D36)) of asphalt and the crude oil source of asphalt are listed in Table 1. It is worth noting that the last digit of the asphalt number listed in Table 1 represents different sampling batches of the same asphalt.

##### 2.2. FTIR Analysis

Through ATR-FTIR (using a Cary 630 FTIR microscope), the IR spectra of asphalt samples were explored. Within the range of 400–4,000 cm^{−1}, 64 scans were conducted, each at a resolution of 1 cm^{−1}. The samples were placed on the horizontal ATR crystal made of zinc selenide, being subjected to multiple reflections. After each operation, the ATR crystal was cleaned using acetone.

The original spectrum data were first subjected to baseline correction by applying the OMNIC software to eliminate baseline effects. Afterwards, based on the standardised variation diagram of preprocessed spectrum data, the difference in masses of different samples was eliminated.

##### 2.3. Multivariate Statistical Analysis

Through the combination of principal component analysis (PCA) and multiple logistic regression analysis, the infrared spectrum data are analysed to establish the discrimination model of the crude oil source of asphalt. Logistic regression analysis is a multivariate analysis method to analyse and predict attribute-dependent variables based on single or multiple continuous or attribute-independent variables. Furthermore, each variable is required to be independent of each other in variable screening and parameter estimation. In many studies, there is a certain degree of linear dependence between their variables, which is called multicollinearity. This multiple collinear relationship may increase the mean square error and standard error of the estimated parameters, which leads to the instability of the analysis results of the logistic regression model. The main reason for the problem of multicollinearity is the overlap of information. However, PCA can reduce the repeatability of information and achieve the purpose of eliminating multicollinearity by extracting independent principal components from explanatory variables.

For this reason, this study used a multinomial logistic regression model based on PCA to improve the discrimination accuracy of the model. First of all, the PCA was used to reduce the dimension of the CAPs variables of the infrared spectrum, so that the variables with strong correlation were integrated into the same principal components. The principal components were independent of each other; thus, the multiple collinear relationship between variables was eliminated. Then, by using these principal components as independent variables, the discriminant model of crude oil source of asphalt was obtained by logistic regression analysis.

###### 2.3.1. PCA Analysis

PCA refers to a simplification of multidimensional data to several relevant variables (principal components) through a dimension reduction approach. Each principal component reflects most of the information of original variables, and the contained information is not repeated. PCA can compress countless information and simplify complex problems [34]. The modelling process of PCA is as follows:(1)Calculation of the correlation coefficient matrix: where refers to the correlation coefficient of original variables and , , which can be calculated by using the following formula:(2)Calculating eigenvalues and eigenvectors: The characteristic equation was solved. Generally, the eigenvalues were calculated by using the Jacobi method and, in descending order are . The eigenvectors corresponding to eigenvalue were separately calculated, satisfying , that is, where denotes the *j*^{th} component of vector .(3)Calculating contribution and cumulative contribution of principal components: In general, the eigenvalues with the cumulative contribution not lower than 70% are taken. are the corresponding first, second, …, *m*^{th} principal components.(4)Calculating the loads of principal components:(5)Scores of various principal components:

###### 2.3.2. Logistic Regression Analysis

Logistic regression is a multivariate analysis method for investigating the relationship between binominal or multinomial observation results (dependent variable) and influencing factors (independent variable), belonging to probabilistic nonlinear regression methods. The logistic regression when the dependent variable only shows two or more states belongs to binomial logistic regression and multinomial logistic regression, respectively [35, 36]. For discriminating and classifying the crude oil of asphalt, multinomial logistic regression is applied to conduct data analysis, owing to the crude oil of asphalt being sourced from the Bohai Rim region of China, South America, and the Middle East.(1)Model fitting: For multinomial logistic regression, a certain level of dependent variables is defined as the reference level herein. Compared with the other levels, *i*-1 (*i* refers to the number of dependent variables) generalised logistic regression models were fitted. By taking three-level dependent variables as an example, it is supposed that the values of dependent variables are 1, 2, and 3: the probabilities corresponding to the values are , , and , respectively. Based on *m*-independent variables, two models are fitted as follows:(2)Meaning of regression parameters: For multinomial logistic regression, each independent variable contains parameters. The parameter represents an independent variable that changes one unit on the premise that other independent variables remain unchanged, and it reflects the variation of the log-odds ratio (OR) of class *i.* The OR is subjected to logarithmic transformation to obtain the linear mode of the logistic regression model.

#### 3. Results and Discussion

##### 3.1. Establishment of Discrimination Indices for Crude Oil Source of Asphalt

FTIR is an important means of identifying organic compounds. When irradiating organics using the IR light, the molecules absorb the IR light leading to vibrational energy level transition, and different chemical bonds or functional groups show diverse absorption frequencies. The contents of various materials are reflected in their IR absorption spectra, which can be quantitatively analysed according to peak location and absorption intensity. The structural composition of asphalt is complex, and asphalt shows significant differences in behaviour. For these reasons, it fails to effectively characterize the difference of behaviours of asphalt from different crude oil only by quantitatively comparing the peak areas of IR spectrograms. Therefore, by observing the shapes and locations of IR spectrograms, 12 significant characteristic absorption peaks (CAPs) were selected to analyse the transmittances of absorption peaks.

The IR absorption spectra of 33 asphalt samples are similar. By using the mean value method, the mutual mode of the IR spectrogram of all asphalt samples was constructed (Figure 1): the assignments of 12 characteristics peaks are as follows: the strong absorption peaks around 2850 cm^{−1} and 2920 cm^{−1} are triggered by the stretching vibration of , and a very weak absorption peak around 1700 cm^{−1} is induced by the stretching vibration of C=O. Moreover, the vibration of the benzene ring leads to the absorption peak in the vicinity of 1600 cm^{−1}, and the absorption peaks at 1380 cm^{−1} and 1460 cm^{−1} are caused by the bending vibration of . The fingerprint region appears below 1300 cm^{−1}, in which the absorption peaks at 1166 cm^{−1} and 1032 cm^{−1} are triggered by the stretching vibrations of C=S and S=O, respectively. The stretching vibration of CH results in a weak absorption peak around 969 cm^{−1}, while the absorption peaks at 872 cm^{−1} and 812 cm^{−1} are induced by vibrations of an isolated hydrogen and two adjacent hydrogen atoms on the benzene ring, respectively. Additionally, the absorption peak at 723 cm^{−1} is also caused by the stretching vibration of .

##### 3.2. Analysis of Predictive Variables Based on Descriptive Statistics

The IR spectra of all asphalt samples are similar, and it is difficult to distinguish the differences among asphalt samples by comparing spectrograms alone. Hence, 12 significantly different CAPs were selected from the spectrograms to describe the transmittances of absorption peaks based on descriptive statistics. From two aspects of centralised location (including indices such as average and median) and degree of dispersion (including indices such as extreme value), the samples are described so as to reflect spectrographic data (Table 2).

In Table 2, according to the analysis result of descriptive statistics on the transmittances of 12 CAPs, it can be seen that the asphalt produced by crude oil from the Bohai Rim region of China showed a larger transmittance. By contrast, the transmittances of asphalt produced by crude oil from the Middle East and South America were consistently low. However, it is impossible to distinguish the oil source of asphalt based on the descriptive statistics of infrared spectral transmittance of asphalt. Therefore, it is necessary to introduce multivariate statistical analysis methods, such as multinomial logistic regression analysis based on PCA described in Section 2.3.

##### 3.3. Correlation Analysis of Predictive Variables

Correlation analysis aims to explore the correlation among multiple variables, which is also an important parameter for evaluating the fingerprint variables of asphalt [37]. In order to further evaluate whether the selected 12 variables were of sufficient significance to the prediction model, a correlation analysis of the 12 CAP variables was required. Generally, correlation analysis is conducted by applying Pearson and Spearman correlation coefficients. The Pearson correlation coefficient is generally applicable to data satisfying a normal distribution, and the Spearman correlation coefficient is employed for data that do not satisfy a normal distribution. Therefore, before the correlation test, it is necessary to test the normal distribution of 12 variables to determine the appropriate correlation test method.

By using the skewness-kurtosis test method, whether the transmittances of the 12 CAPs of 33 asphalt samples conform to a normal distribution was assessed, and through the K-S test as an auxiliary analysis method, the accuracy of the test results was ensured [38, 39]. The 12 variables were processed by importing them into SPSS19 (Tables 3 and 4).

It can be seen from Table 3 that the values of skewness and kurtosis of transmittances of the 12 CAPs of all asphalt samples produced by three origins of oil fluctuate within a certain small positive and negative range around zero. It can be further seen from Table 4 that the asymptotic significances of the 12 variables all exceed 0.05. Moreover, based on the result of the skewness-kurtosis test, it can be considered that the 12 variables of 33 asphalt samples all conform to a normal distribution, which provides a basis for determining the method for testing correlation among variables. Therefore, the Pearson correlation coefficient is used to analyse the correlation between variables (Table 5).

As shown in Table 5, the IR CAP at 2850 cm^{−1} showed a significant correlation with those at 2920, 1460, and 723 cm^{−1}, respectively. Additionally, there are significant correlations between each IR CAP at 1700, 1600, 1460, 1380, 1166, 1032, 969, 872, 812, and 723 cm^{−1}. Moreover, multiple CAPs exhibited a high correlation. The aforementioned CAPs with high correlation covered all 12 CAPs. This showed that the 12 selected CAPs contained most of the fingerprint information about the asphalt, thus providing a basis for selecting variables capable of discriminating the different crude oil sources of asphalt.

##### 3.4. Establishment of Logistic Regression and Discriminant Model Based on PCA

###### 3.4.1. PCA on All Variables

According to the correlation analysis of variables, it can be seen that the information contained in the 12 CAPs shows a certain repeatability. PCA not only can remove repeated information but can retain key information, thus realising dimension reduction. Furthermore, it makes the modelling for logistic regression and discrimination more reliable due to reducing the disturbance caused by accidental factors.

The transmittances of the 12 CAPs of 33 asphalt samples are input into the SPSS19 software for PCA. The results are displayed in Table 6 and Figure 2. As shown in Table 6, there are three principal components whose eigenvalues exceed one. The first, second, and third principal components explain 77.658%, 15.498%, and 3.508% of the nature of the original variables, respectively. The cumulative variance contribution of the three principal components is 96.664% (research shows that there is a high explanation rate when the cumulative contribution is higher than 70%). It can be seen from the scree plot (Figure 2) that the broken lines of the first three principal components are steep while later tending to become shallower. This further indicates that it is appropriate to extract the three principal components (PCA1, PCA2, and PCA3). According to the correlation coefficients between the principal component and the original variables, the principal components *Y*_{1}, *Y*_{2}, and *Y*_{3} are separately expressed as follows:where represent the transmittances of CAPs at 2920, 2850, 1700, 1600, 1460, 1380, 1166, 1032, 969, 872, 812, and 723 cm^{−1}, respectively.

###### 3.4.2. The Process and Result of Multinomial Logistic Analysis

By substituting the transmittances of the 12 CAPs of 33 asphalt samples into formulae (9–11), the scores of the three principal components can be calculated (Table 7). Moreover, the scores of the principal components are taken as factors, and three kinds of origins of asphalt are considered as dependent variables. Among them, the crude oil from the Bohai Rim region of China is regarded as a reference group to establish a multinomial logistic regression model based on principal components. On this basis, the parameters of the three principal components used for the logistic regression model are obtained.

Based on the parameter from regression, the logistic regression model can be obtained as follows:where refer to the probabilities of crude oil sources (the Middle East, South America, and Bohai Rim region of China) of asphalt and *Y*_{1}, *Y*_{2}, and *Y*_{3} denote the first, second, and third principal components, respectively.

By substituting expressions (9), (10), and (11) into expression (12), the expression (formula (13)) for characterising the relationship between the logistic regression model and the 12 variables can be acquired. During discrimination and prediction, the probabilities of crude oil sources of asphalt can be separately acquired by substituting the transmittances of the 12 CAPs of the asphalt. The maximum probability corresponds to the predicted origin:

Additionally, to validate whether the model shows adequate practical meaning, it is necessary to test the goodness of fit, pseudo *R*-squared, and likelihood ratio of the model. The tests (including the Pearson chi-square test and the deviance chi-square test) of goodness of fit can test whether the model fits the original data, or not. If the significance level exceeds 0.05, the fitting effect is favourable. The pseudo *R*-squared value can verify the degree of explanation offered by the model for information contained in its original variables, which is shown in Cox, Nagelkerke and McFadden pseudo *R*-squared values. The closer the result is to 1, the better the explanation. The likelihood ratio test measures the contribution of original variables to the model. If the significance level is lower than 0.05, the contribution of original variables is high.

According to the test result (Table 8) obtained through use of the logistic regression model, the goodness of fit, pseudo *R*-squared, and likelihood ratio of the model all satisfy test requirements. This indicates that the extracted principal components PCA1, PCA2, and PCA3 also retain key information about the data while effectively realising dimension reduction, which makes a significant contribution to the construction of the logistic regression model. The final result obtained through model regression is also meaningful.

###### 3.4.3. Validation of Discriminatory Effect of the Model

By taking IR CAPs of 33 original asphalt samples as verification samples, the discrimination effect obtained through the multinomial logistic regression model in multivariate statistical analysis was evaluated by applying formula (13). The discrimination result of multinomial logistic regression in multivariate statistical analysis is shown in Table 9.

As shown in Table 9, discrimination accuracies of 15, 12, and six asphalt samples separately produced by crude oil sourced from the Middle East, South America, and Bohai Rim region of China are 93.3%, 91.7%, and 100%, respectively. The comprehensive discrimination accuracy is 93.9%. The above result showed that multivariate logistic regression analysis based on PCA can rapidly discriminate the origins of asphalt.

#### 4. Conclusions

Based on ATR-FTIR technology, the infrared spectra of 33 kinds of asphalt produced by crude oil from the Middle East, South America, and Bohai Rim region of China were collected. Furthermore, the 12 selected CAPs of infrared spectra were analysed by multivariate statistics. The comprehensive accuracy of the logistic regression model based on PCA in discriminating asphalt, which were produced by crude oil from three different regions reached 93.9%. The results indicated that the combination of ATR-IR spectral analysis and multivariate statistics can accurately and nondestructively discriminate between different crude oil source of asphalt. Moreover, the method shows some remarkable advantages, including ease of operation, rapidity, and high accuracy, which is important when controlling the origins and quality of asphalt and improving the performance thereof.

The method provided in this paper is suitable for the oil source identification of base asphalt produced by crude oil from different regions and can also provide reference for other kinds of asphalt, such as polymer-modified asphalt. However, the accuracy and applicability of this method need to be further improved. In particular, whether the asphalt produced by crude oil mixing from different regions can be effectively identified needs further research.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was financially supported by the National Natural Science Foundation of China (no. 51608511), Key R&D Project of Shandong Province (2018GGX105013), and Project of Science and Technology Support for Youth Entrepreneurship in Colleges and Universities of Shandong Province (2019KJG004). The authors would like to acknowledge many coworkers, students, and laboratory assistants for providing technical help on instrument analysis.