Abstract

The n-octanol/water partition coefficient (log ) is a useful parameter for the assessment of the environmental fate and impact of xenobiotic trace contaminants. Quantitative structure-activity relationship (QSAR) model for log  of polychlorinated biphenyls (PCBs) was analyzed by using the density functional theory at B3LYP/6-31G(d) level and the partial least squares (PLS) method with an optimizing procedure. A PLS model with reasonably good coefficient () and cross-validation test () values was obtained. All the predicted values are within the range of log unit from the observed values. The log  values of 7 PCBs in the test set predicted by the model are very close to those observed, indicating that this model has high fitting precision and good predictability. The PLS analysis showed that PCBs with larger electronic spatial extent and lower molecular total energy values tend to be more hydrophobic and lipophilic.

1. Introduction

Polychlorinated biphenyls (PCBs) are highly stable industrial chemical products that have been used as industrial fluids, flame retardants, diluents, hydraulic fluids, and dielectric fluids. Due to careless disposal practices and chemical stability, PCBs are some of the most prevalent pollutants in the total environment. Generally, a sum of 209 theoretical PCBs exist, and approximately 150 PCBs are found in the environment [1].

Despite the fact that the production and the use of PCBs have long been banned, they are still considered environmental contaminants as they are among the most widespread and persistent man-made products in the ecosystem. It is well known that these chemicals are toxic and lipophilic and tend to be bioaccumulated. Some interfere with mammalian and avian reproduction, and others disturb human’s germ cell development or are promoters of cancer. Some PCBs are included in the increasing list of environmental pollutants that are able to mimic estrogen action and are termed “environmental estrogens” [2].

PCBs are categorized as persistent organic pollutants (POPs) [3]. Individual PCB congeners exhibit different physicochemical properties and biological activities that result in different environmental distributions and toxicity profiles. The variable composition of PCB residues in environmental matrices and their different mechanisms of toxicity complicate the development of scientifically based regulations for the risk assessment [4]. The n-octanol/water distribution coefficient () is an important physicochemical property suitable for the characterization of the lipophilicity of the compounds such as PCBs. has been shown to correlate with aquatic solubility, toxicity, and bioaccumulation [57], and as such it is a useful parameter for the assessment of the environmental fate and impact of xenobiotic trace contaminants [8]. During the last 50 years, has proven to be a valuable tool in many areas of natural sciences: chemistry, biochemistry, biology, pharmacology, environmental sciences, and so forth. During that period, many hundreds or even thousands of scientific publications have been made in which the successful application of in correlation with many physical, chemical, or biological properties has been demonstrated for a large variety of organic chemicals. Unfortunately, many correlations have been also published in which the processes studied have nothing in common with the process of partition between water and n-octanol phases. It must be emphasized that in order to correlate, two properties do not have to be related; both of them may simply be related to a third property. The shake flask is the classical method for measuring , and this procedure will give reproducible and accurate results for chemicals with . Accurate results can also be obtained for more hydrophobic chemicals, but this will need a very precise and skillful handling and the use of very pure n-octanol. The problem with this procedure is that the complete phase separation is difficult to achieve, and this may lead to the underestimation of , particularly for the more hydrophobic chemicals [9].

It is impractical to measure the of all the PCBs directly in the laboratory because of the extremely low water solubility of some high-chlorinated PCBs. The lack of complete, reliable, and comparable data has led to the development of different estimation methods. With the advent of inexpensive and rapid computation, there has been a remarkable growth in the area of quantitative structure-activity relationships (QSAR), which correlate the properties of pollutants with relevant properties and molecular descriptors [11]. A large number of calculation methods have been presently developed for estimation of the partition coefficients with varying success and applicability. These methods are cable of predicting environmental behaviors not only for common environmental pollutants but also for those chemicals that have not yet been synthesized or those that cannot be examined experimentally due to their extremely hazardous nature. According to the descriptors used, these methods can be classified into two groups: empirical and theoretical methods [12]. Hansen et al. [1] used the degree of chlorination in the orthoposition of the PCB molecule as predictor variable together with either the calculated molecular total surface area (TSA) or the gas chromatographic relative retention times (RRT) to predict the and soil sorption coefficients () for all 209 PCB congeners which were based on 53 and 48 experimental data points, respectively. Also, reporting the relationship between and , which is promising for , could proceed with . However, TSA and RRT are both empirical molecular descriptors. Firstly, the accurate determination of empirical molecular descriptors may be difficult and expensive in terms of cost and time or even impossible for some PCBs which might have been not synthesized or purified. Meanwhile, the experimental errors may be introduced, especially for those congeners difficult to be separated and identified by chromatography. Furthermore, the empirical molecular descriptors often reflect complex and multiple physical interactions. Hence, the interpretation of the respective property could be difficult and ambiguous [13]. Many attempts have been made to model the physicochemical properties and bioactivities of PCBs by utilizing QSAR techniques [1421]. For instance, Qin et al. [17] reported the correlation of bioconcentration factors and molecular electronegativity distance vector (MEDV) of PCBs based on molecular structure. Tuppurainen and Ruuskanen [18] introduced a new QSAR descriptor electronic eigenvalue (EEVA) based on molecular orbital energies to predict the Ah receptor binding affinity of PCBs, polychlorinated dibenzo-p-dioxins, and dibenzofurans (PCDD/Fs). Daren [19] reported QSPR studies of , , , and of all 209 PCBs and biphenyl by the combination of genetic algorithms and PLS analysis.

Modern theoretical method in quantum chemistry such as density functional theory (DFT) with high calculation precision should have its advantages in estimating the properties of environmental toxics such as PCBs, PCDD/Fs [13, 22]. The aim of the present study is to explore QSAR model relating to of 27 PCBs by using partial least squares (PLS) analysis with optimizing procedure, based on reported data and quantum chemical descriptors computed by DFT contained in Gaussian 03, in an attempt to develop reliable and predictive QSAR model and identify the primary molecular factors governing the of PCBs.

2. Materials and Methods

2.1. Target PCBs

A total of 27 PCBs, whose and/or were previously published, were chosen to constitute the training set and the test set in this work. The training set consists of 20 of these 27 PCBs selected randomly, and the remaining 7 PCBs constitute the test set. Their chemical abstracts service numbers (CAS no.) and reported are listed in Table 1.

2.2. Calculation and Selection of the Descriptors

The molecular modeling system HyperChem (Release 7.0, Hypercube Inc. 2002) was used to construct and view all molecular structures. Molecular geometry was optimized, and quantum chemical descriptors were computed using the B3LYP hybrid functional of DFT in conjunction with 6-31G(d), a split-valence basis set with polarization function. The B3LYP calculations were performed using the quantum chemical computation software Gaussian 03 [23]. All calculations were run on an Intel Core i5-2410M CPU/2.30 GHz computer equipped with 2.00 GB of internal memory and Microsoft Windows 7 professional operating system.

According to the chemometrics theory, it is suggested that a QSAR model should include as many relevant descriptors as possible to increase the probability of a good characterization for a class of compounds [24]. In this study, 18 independent quantum chemical variables were selected for developing QSAR models. The descriptors cover the dihedral angle of the two benzene ring plans (DA), the eigenvalue of the highest occupied molecular orbital , the eigenvalue of the lowest unoccupied molecular orbital , the Mulliken atomic charges on 12 carbon atoms from C1 to C12 (, Figure 1 shows the numbering of selected atoms on the molecular structure), the electronic spatial extent (Re), the dipole moment (), and molecular total energy (TE). All the above descriptors were obtained directly in the output files of Gaussian 03 calculation through a single full optimizing process for the molecular structure: B3LYP/6-31G(d) FOPT. The values of these quantum chemical descriptors are listed in Table 2. The units of dihedral angle, atomic charge, electronic spatial extent, dipole moment, and energy are as follows: degree, atomic charge unit, atom unit, Debye, and Hartree, respectively.

2.3. Modeling Method and Evaluation Indexes

Since a big number of descriptors were selected in this study, intercorrelation of independent variables (multicollinearity) might become a technical problem. It was especially difficult when the number of independent variables was equal to or greater than the number of compounds selected in the training set. The common regression analysis frequently used in QSAR studies became ineffective because over-fitting readily occurred. To overcome these problems, we used the PLS regression, a well-known multivariate method which gives a stepwise solution for a regression model. It extracts principal component-like latent variables from original independent variables (predictor variables) and dependent variables (response variables), respectively [24]. This method can analyze data which are strongly collinear, and more importantly, it is a remedy to overfitting. PLS finds the relationship between a matrix Y (containing dependent variables, often only one for QSAR studies) and a matrix X (containing predictor variables) by reducing the dimension of the matrices while concurrently maximizing the relationship between them.

SIMCA-P (Version 10.5, Umetrics AB, 2004) software was employed to perform the PLS analysis. The default values given by the software were used as the initial conditions for computation. According to the user’s guide to SIMCA-P [25], the criterion used to determine the model dimensionality, the number of significant PLS components, is cross-validation (CV). With CV, the tested PLS component is thought significant if the fraction of the total variation of the dependent variables estimated by a component, , is higher than a significance limit (0.0975) for the whole dataset. The model is thought to have a good predicting ability when the cumulative for the extracted components, , is more than 0.500 [25]. Model adequacy was assessed based primarily on the number of PLS principal components (), , the squared correlation coefficient between observed values and fitted values (), the standard error (SE) of the estimate, the variance ratio of variance analysis (), and the significance level ().

3. Results and Discussion

3.1. Modeling and Optimizing

For the 20 PCBs contained in the training set, PLS analysis was performed with as the dependent variable and the 18 independent variables as the predictor variables, yielding QSAR model I. For this model, we have , SE = 0.145, , and . The fitting results for this model are listed in Table 3. In Table 3, and stand for the fraction of the sum of squares of all the ’s and ’s explained by the current component, and stand for the fraction of the cumulative variance of all the ’s and ’s explained by all extracted components, respectively, andEig stands for eigenvalue, which denotes the importance of the PLS principal component.

Variable importance in the projection (VIP) is a parameter that shows the importance of a variable in a model in the assistant analysis of PLS modeling. According to the manual of SIMCA-P, terms with large VIP (>1.0) are the most relevant for explaining dependent variables [25]. To obtain an optimal model, the following PLS analysis procedure was adopted. A PLS model with all the predictor variables was first calculated and then the variable with the lowest VIP was eliminated, and a new PLS regression was performed, yielding a new PLS model. This procedure was repeated until an optimal model was obtained. The optimal model was selected with respect to , , SE, , and . According to the statistics and metrics theories, a QSAR model with larger values of , and and smaller values of SE and tends to be more stable and reliable than in the opposite case.

The above described PLS analysis procedure with as dependent variable and the 18 independent variables as the initial predictor variables, for the 20 PCBs contained in the training set, resulted in model II. For this model, we have , SE = 0.151, , and . The fitting results for this model are shown in Table 3. As shown in the table, based on the estimated indexes, model I is better than model II, indicating all that the selected 18 independent variables are necessary for the PLS modeling. In model I, 2 PLS principal components were selected in the model, which explained 69.1% of the variance of the predictor variables and 99.2% of the variance of the dependent variable. Based on the estimated indexes employed in this study, this model is statistically significant.

3.2. Analysis and Discussion

Based on the optimal model I, the predicted for the 20 PCBs contained in the training set are as listed in Table 1. A comparison of the observed and predicted is shown in Figure 2. As can be seen in Table 1 and Figure 2, the predicted values are within the range of ±0.3 log unit from the observed values. The observed values are close to those predicted by the optimal model, and the correlation between observed and predicted is significant. For the PCBs studied, the differences between observed and predicted are very small, so the predicted values are acceptable. Note that the cross-validated value (=0.988) of the optimal model is much larger than 0.500; it would seem that this model is stable and has good prediction ability and can be used to predict the of PCBs effectively.

When and are unscaled and uncentered, the unscaled coefficients of the independent variables and the constant, transformed from the PLS results, are listed in Table 4. The signs of the coefficients of the independent variables in the model enable one to see the direction in which each affects the of PCBs. Based on the unscaled coefficients, a QSAR equation like that obtained from multiple linear regression can be created for the optimal model, and the predicted for other PCBs can be estimated. Based on model I, predictions of for the 7 PAHs contained in the test set were generated and listed in Table 1, and the comparison of observed and predicted is shown in Figure 2. The correlation coefficient between observed and predicted in the test set was as high as 0.995, larger than that in the training set. As can be seen in Table 1 and Figure 2, the predicted for the test set are very close to those observed. This indicates that the PLS model has good predictive ability.

The VIP values for the optimal model were calculated and listed in Table 4. We can find that the VIP values of TE, Re, , and are all greater than 1.0, indicating that they are important in governing the values of PCBs. TE and Re are the two most important variables since their VIP values are larger than 1.40, which is greater than the VIP values of the other variables. It is expected that increases with the increase of molecular size. Considering the following examples: 4-chlorobiphenyl (, molecular weight (MW) = 188.5), -dichlorobiphenyl (, MW = 223.0), -trichlorobiphenyl (, MW = 257.5), -tetrachlorobiphenyl (, MW = 292.0), which differ between themselves by the number of chlorine substituents, the increase in as a function of the number of chlorine substituents and molecular sizes is clearly seen. When H atom is replaced by Cl atom on the biphenyl, value will be increased [26]. So, the more chlorine atoms on the biphenyls molecular, the greater the value. When isomers share the same molecular weight, molecular sizes differ from the arrangements of chlorine atoms, which can be expressed by molecular volume and surface area, leading to the difference in . 2-Chlorobiphenyl (), 3-chlorobiphenyl (), and 4-chlorobiphenyl () are isomers (MW = 188.5), but they have different n-octanol/water partition coefficients.

The variety of molecular size could be described by the quantum chemical descriptors of TE and Re in this work. Firstly, for the whole compounds set in this study, the decrease of TEvalues indicates that the molecule contains more chlorine atoms, which might lead to the increase of molecular volume. Zou et al. [27] found that molecular volume and molecular surface area were highly correlated for PCBs sharing similar structures. So, the decrease of TE values might lead to the increase of molecular surface area. Molecular volume and molecular surface area are measures of molecular size. In general, the bigger the molecular size, the stronger the effect of the intermolecular dispersion force. On the other hand, the molecular volume is a measure of absorbed energy of the molecule when it forms the hole in a solvent. Due to stronger polar of water molecules contributing to stronger binding energy, the larger solute molecules need to absorb more energy to destroy the binding between water molecules when distributing in the aqueous phase, resulting in tending to be allocated in weakly polar solvents. It means that the decrease of TE values might lead to larger which is consistent with the study of Zou et al. [27]. Secondly, Re is the expectation value of the operator in the formula of . It is occasionally seen in the literature as a measure of molecular volume. Although in many cases Re correlates poorly with molecular volume, the correlation for PCBs sharing similar structures is quite significant. A PCB molecule with large electronic spatial extent will have large molecular volume. Therefore, it can be concluded that PCBs molecules with lower molecular total energy values and larger electronic spatial extent tend to be more hydrophobic and lipophilic, leading to larger values. This implies that the decrease of the TE value and the increase of the Re value lead to an increase in value.

The value of is in inverse ratio to , and ; that is, large value of and would result in small value of for PCBs. This is because represents the proton acceptance ability in forming hydrogen bond, while represents the proton donation ability in the formation of hydrogen bond. Therefore, the compounds with large values of and tend to donate or accept protons easily. In this case, hydrogen bond can be easily formed between PCBs and H2O molecules, and PCBs are able to enter water phase, resulting in low value [13].

3.3. Comparison with the Results from Other Models

In [13], there are 3 variables (, , and ) obtained from B3LYP/6-31G(d) level in the model, and the squared correlation coefficient is , less than that of the optimal model in the present study. It was found that there are high correlations relating the number of Cl substitution position () to and . And, taking as theoretical descriptors to correlate the three-variable models for predicting of PCBs, the achieved new model is having and exhibiting optimum stability, convenience for use, and better predictive power than that from AM1 method [28]. In contrast, the squared correlation coefficient obtained by PLS in the present study is obviously larger than that reported.

Taking molar weight () and melting point (, °C) as independent variables, Wang et al. [10] had developed the binary linear regression equation whose predicted are also listed in Table 1, which seems superior to the optimal model I in the present study obtained solely on quantum chemical descriptors. However, the variable melting point in the model equation is an experimental physicochemical parameter. It is impractical to determine the melting points of all 209 PCBs directly in the laboratory which makes the model become of limited applicability. Therefore, our optimal model based only on quantum chemical descriptors has wider applicability than that based on known physicochemical properties. The overall high quality of the model obtained in this study indicates that it will find application in the estimation of of PCBs for which experimental values are lacking.

4. Conclusions

In this work, based on some quantum chemical descriptors computed by DFT at the B3LYP/6-31G(d) level, by the use of PLS analysis, QSAR models were obtained for logarithmic n-octanol/water partition coefficients of PCBs. The squared correlation coefficient of the optimal model is 0.992. The optimal model has high fitting precision and good predictability, so it can be applied in the estimation of the n-octanol/water partition behavior of PCBs. It can generally be concluded that PCBs with larger electronic spatial extent and lower molecular total energy values tend to be more hydrophobic and lipophilic.

Conflict of Interests

The authors declare no conflict of interests.

Acknowledgments

This study was financially supported by the National Natural Science Foundation of China (21007017), the Program for New Century Excellent Talents in University (NCET-12-0199), the Specialized Research Fund for the Doctoral Program of Higher Education (20100172120037), the China Postdoctoral Science Foundation (201003350), and the Guangdong Natural Science Foundation (9351064101000001).