Abstract

A quantitative structure-retention relationships (QSRRs) method is employed to predict the retention time of 300 pesticide residues in animal tissues separated by gas chromatography-mass spectroscopy (GC-MS). Firstly, a six-parameter QSRR model was developed by means of multiple linear regression. The six molecular descriptors that were considered to account for the effect of molecular structure on the retention time are number of nitrogen, Solvation connectivity index-chi 1, Balaban Y index, Moran autocorrelation-lag?2/weighted by atomic Sanderson electronegativity, total absolute charge, and radial distribution function-6.0/unweighted. A 6-7-1 back propagation artificial neural network (ANN) was used to improve the accuracy of the constructed model. The standard error values of ANN model for training, test, and validation sets are 1.559, 1.517, and 1.249, respectively, which are less than those obtained reveals by multiple linear regressions model (2.402, 1.858, and 2.036, resp.). Results obtained the reliability and good predictability of nonlinear QSRR model to predict the retention time of pesticides.

1. Introduction

Pesticides are used on a large scale for agricultural purposes. The adverse effects of pesticides on both human health and the environment are a matter of public concern. Thus, both the actual state and the transition of pesticide residues in various matrices including water, soil, and agricultural products should be extensively monitored. These researches should be undertaken using an efficient analytical system with a laborsaving and cost-effective device, as pesticides as well as applicable fields of research rang over a broad spectrum. Conventional sample preparation methods used to analyze pesticide residues in various matrices require expensive instrumentation, an expert analyst [14]. Besides the above mentioned, the experimental determination of chromatographic retention parameters of pestocides is time consuming and expensive. Alternatively, quantitative structure-retention relationship (QSRR) provides a promising method for the estimation of retention time based on descriptors derived solely from the molecular structure to fit experimental data. The advantages of this approach lie in the fact that it requires only the knowledge of chemical structure and is not dependent on any experiment properties.

QSRR studies [5, 6] started from the calculation and selection of descriptors, to finding their relation to retention times and derivation of mathematical models that involve these multivariate data in order to be used for predictive purposes in chromatographic system. Multivariate data consist of the results of observations of many different variables (molecular descriptors) for a number of individuals (molecules). Known methods for this include the multiple regression analysis, experimental design techniques, and nonlinear regression. The drawback, sometimes, of these very popular techniques is their inability to give highly predictive models due to hidden nonlinearity inside the data variables or the prerequisite to specify the mathematical model before the fitting of the data. So there is a need to improve further such kind of models in order to extract the most accurate prediction. To this end, artificial neural networks (ANNs) could be used successfully in QSRR studies providing better results than the conventional regression models.

Today artificial neural networks [710] have become an important modeling technique for QSAR and QSPR, and also this technique has been applied in numerous application areas of chemistry and pharmacy [1116]. The mathematical adoptability of ANN commends them as powerful tool for pattern classification and building predictive models. A particular advantage of ANNs is their inherent ability to incorporate nonlinear dependencies between the dependent and independent variables without using an explicit mathematical function. There are few reports about the application of QSRR in the chromatographic studies: D’Archivio et al. modeled the combined effect of solute structure and eluent composition on the retention behavior of 26 pesticides in isocratic reversed-phase high-performance liquid chromatography using multilinear regression and artificial neural networks [17]. In another study, they applied a six-parameter nonlinear QSRR model to predict the retention behavior of 26 pesticides including commonly used insecticides, herbicides, and fungicides as well as some metabolites in reversed-phase high-performance liquid chromatography [18]. Also Ghasemi and his coworkers used multiple linear regression and partial least squares regression to QSRR study of the gas chromatography retention time of 38 diverse chlorinated pesticides, herbicides, and organohalides by using molecular descriptors [19]. In the present study, the application of ANN is being described in order to predict accurately the retention time values of 300 pesticides in four groups with different molecular structures [20].

2. Methods

2.1. Dataset

Development of the multiple linear regression and artificial neural networks in the present work relies on a data set taken from reference [20]. This dataset (Table 1) consists of 300 pesticides in animal tissues such as beef, mutton, pork, chicken, and rabbit ranging in retention time from 5.62 to 35.77?min. All of these 300 pesticides divided into four groups, depending upon properties and retention time of each pesticide. Each group is consisting different kind of pesticides such as acaricides, insecticides, and Fungicides. To apply the ANN modeling, the dataset was randomly divided into three groups of training, test, and validation sets consisting of 256, 22, and 22 pesticides, respectively. The training set was used for the model generation. The test set plays a different role in the cases of the MLR and the ANN models. For the ANN model, this set was used for early stopping to optimize learning iteration and avoid overtraining. The validation set was used to assess the accuracy of the ANN predictions. On the other hand, in the case of the MLR model, the test set and the validation set were used to evaluate the model. As can be seen from Table 1, the pesticides in the test and validation sets were chosen in a way that adequately represents the training set in term of retention time.

2.2. Molecular Descriptors

All structures were generated with the HyperChem (Version 7) [21] and optimized with the classic potential MM+ included. Molecular geometry was optimized with the Austin Model 1 (AM1) method [22], and then the molecular descriptors were calculated by the software Dragon 3.0 [23]. Overall more than 1400 theoretical descriptors were calculated for each molecule by this software. These descriptors can be classified into several groups: 0D: constitutional descriptors; 1D: functional groups, atom-centered fragments, empirical descriptors and molecular properties; 2D: topological descriptors, molecular walk counts, BCUTs descriptors, Galvez topological charge indices, and 2D autocorrelations; 3D: aromaticity indices, Randic molecular profiles from the geometry matrix, geometrical, RDF, 3D-MORSE, WHIMs, and GETAWAYs descriptors. Molecular descriptor meanings and their calculation procedure are explained in Handbook of molecular descriptors by Todeschini. These molecular descriptors of different kinds were used to describe compound chemical diversity.

2.3. Regression Analysis

The main goal of the generation of the MLR model was to choose a set of suitable descriptors that can be used as inputs for construction of the ANN model. Linear models were formed by a stepwise selection of important descriptors and MLR model construction [24]. The best MLR model is one that has high correlation coefficient and ??-value, low standard error, and high prediction power. The statistics of the constructed MLR model is presented in Table 2.

2.4. Artificial Neural Network

A detailed description of theory behind artificial neural networks has been adequately described in several publications [2531]. An ANN program was written in MATLAB 7 in our laboratory. This network was feed-forward fully connected that has three layers with sigmoidal transfer function. Descriptors appearing in the MLR models were used as inputs of network and signal of the output node represent the retention time of interested compound. Thus, this network has six nodes in input layer and one node in output layer. The value of each input was divided into its mean value to bring them into dynamic range of the sigmoid transfer function of the network. The back-propagation algorithm was used for the training of the network [32, 33]. Before training the network, the parameters of the number of nodes in the hidden layer, weights and biases learning rates, and momentum values were optimized [3436]. Optimized values of these parameters were numbers of nodes in the input layer (=6), numbers of nodes in the hidden layer (=7), numbers of nodes in the output layer (=1), weights learning rates (=0.1), bias learning rate (=0.4), momentum (=0.5), and transfer function was sigmoid. The ANN-calculated values of permeability coefficient for training, test, and prediction sets, are shown in Table 1.

3. Results and Discussion

3.1. Analysis by Multiple Linear Regressions

The best MLR model (Table 2) for the training set includes six descriptors. These descriptors are number of nitrogen, Solvation connectivity index-chi 1, Balaban ?? index, Moran autocorrelation-lag?2/weighted by atomic Sanderson electronegativity, total absolute charge, and radial distribution function-6.0/unweighted.

The correlation between these descriptors is shown in Table 3. As shown in this table, there are no significant correlations between these descriptors.

The first descriptor with the larger mean effect is solvation connectivity index—chi 1 (?1sol) that defined in order to model solvation entropy and describe dispersion interactions in solution. Taking into account the characteristic dimension of the molecules by atomic parameters, they are defined as????????=?12???+1·?????+1??????????=1?????????????=1???????1/2????,(1) where ???? is the principal quantum number (2 for C, N, O atoms, 3 for Si, S, Cl, …) of the ??th atom in the ??th subgraph and ???? the corresponding vertex degree; ?? is the total number of ??th order subgraphs; ?? is the number of vertices in the subgraph. The normalization factor 1/(2)??+1 is defined in such a way that the indices ???? and ?????? for compounds containing only second-row atoms coincide. The first-order solvation connectivity index is 1????=14·?????????·????????????·???????1/2????,(2) where ?? runs over all the bonds; ???? and ???? are the principal quantum numbers of the two vertices incident to the considered bond. This index coincides with the Randic connectivity index 1?? for the hydrocarbons; ??=2 for all the atoms. These molecular descriptors are defined for an H-depleted molecular graph [37].

The positive sign for the mean effect of this descriptor reveals that molecules have higher numerical value of ?1sol, therefore, they have longer retention time. For example, HCH-epsilon (compound 164) with 6 chloride atoms in its structure has bigger ?1sol (7.75) than Allidochlor (compound 1) with one chloride atom in its structure (5.61) and also has longer retention time than Allidochlor (20.78 and 8.78?min, resp., Figure 1). Compounds that have more atoms in their structure have larger numerical value of ?1sol and so they have longer retention time, for example, compound 194 (Alpha-cypermethrin) has ?1sol value of 13.32 and retention time of 33.35?min but compound 4 (Chlormephos), which has less atoms and shorter structure than Alpha-cypermethrin, has ?1sol and retention time of 7.21 and 10.53?min, respectively (Figure 2). The second important descriptor is Balaban ?? index which was calculated by the same formula as the Balaban distance connectivity index ??, but by using atomic information indices instead of vertex distance degrees [38]. The ?? index is defined based on atomic information indices ???? calculated for vertices of a H-depleted molecular graph as follows:????=?????????-1????·??·log2??,(3) where ?? runs over all of the different topological distances from the ??th vertex, ?????? is the number of distances from the ??th vertex equal to ??, and ???? is the ??th atom eccentricity (i.e., the maximum topological distance from the considered atom) [38]. So, the Balaban ?? index is calculated as????index=·???+1edge(??,??)?????·?????-1/2,(4) where ?? is the number of bonds and ?? is the cyclomatic number. As the number of cycles, and branching in molecular structures increased (so ???? in (2) increased), the value of ?? index decreased. For example, compounds 70 (EPTC), 80 (Cis-Diallate), and 97 (Thiobenarb) which have amine group in their structures with retention time of 8.54, 14.75, 20.63?min have ?? index values of 2.27, 2.15, 1.02, respectively. This descriptor has negative sign for its mean effect. Therefore, as the numerical value of this descriptor increased the retention time of compounds decreased. For example in compounds 149 (Methacrifos), 27 (Methyl-parathion), 131 (Triazophos) which are organophosphorous compounds ?? index values are 2.08, 1.18, and 0.81, and retention times are 11.86, 20.82, and 28.23?min, respectively.

The total absolute charge is the other descriptor which, also known as the electronic charge index (ECI), is the sum of absolute charge over all atoms in a molecule and is a measure of molecule polarity [37]. Therefore, compounds with polar bonds have more numerical value of Qtot and have longer retention on a polar stationary phase than others. For example, compound 288 (Pirimiphos-methyl) has more retention and polarity (20.30?min and 10.14, resp.) than Dichlobenil (compound 72) that has Qtot of 1.02 and retention time of 9.75?min (Figure 3), also Compounds 9 (Ethalfuralin), 21 (Dinitramine), and 175 (Flumetralin) that contain CF3 and two nitro groups in their structure have retention time of 15, 19.35, 24.10, and Qtot value of 6.66, 6.86, and 7.33, respectively. The other molecular descriptors with lower mean effects are number of nitrogen atoms, Moran autocorrelation-lag?2/weighted by atomic sanderson electronegativity and radial distribution-6.0/unweighted. As number of nitrogen atoms increase, in molecule’s structure due to increasing the interaction of molecules with polar stationary phase, the retention time of compound increases. For example, in etridiazol, atrazine-desethyl, and fluquinconazole (compounds 3, 11, and 255, resp.) which have 2, 5, and 6 nitrogen atoms in their structure, their retention time is 10.42, 16.76, and 32.62, respectively. A Moran coefficient is a general index of spatial autocorrelation that, if applied to a molecular graph, can be defined as(???(??)=1/?)·????=1?????=1??????·?????-???·?????-????(1/??)·????=1?????-???2,(5) where ???? is any atomic property (here Sanderson electronegativity), ?? is its average value on the molecule, ?? is the atom number, ?? is a Kronoker delta (??????=1 if ??????=??, zero, otherwise). ? is the sum of the Kronoker deltas, that is, the number of vertex pairs at distance equal to ?? [39].

So the Moran autocorrelation-lag?2/weighted by atomic sanderson electronegativity can be a factor of electronegativity of moleculs. The last descriptor is radial distribution-6.0/unweighted. The 3D coordinates of the atoms of molecules can be transformed into a structure code that has a fixed number of descriptors irrespective of the size of a molecule. This task is performed by a structure coding technique referred to as radial distribution function code (RDF code) [40]. In general, there are some prerequisites for a structure code: independence from the number of atoms, that is, the size of a molecule, unambiguity regarding the three-dimensional arrangement of the atoms, and invariance against translation and rotation of the entire molecule.

Formally, the radial distribution function of an ensemble of ?? atoms can be interpreted as the probability distribution to find an atom in a spherical volume of radius ?? [41]. The equation represents the radial distribution function code as it is used in this investigation:??(??)=??·??-1????????>??????·????·??-??(??-??????)2,(6) where ?? is a scaling factor and ?? is the number of atoms. By including characteristic atomic properties ?? of the atoms ?? and ??, the RDF codes can be used in different tasks to fit the requirements of the information to be represented. The exponential term contains the distance ?????? between the atoms ?? and ?? and the smoothing parameter ?? that defines the probability distribution of the individual distances. ??(??) was calculated at a number of discrete points with defined intervals.

The atomic properties ???? and ???? used in this equation enable the discrimination of the atoms of a molecule for almost any property that can be attributed to an atom. Such distribution function provides, besides information about interatomic distances in a whole molecule, the opportunity to gain access to other valuable information, for example, bond distance, ring types, planar and nonplanar systems, and atoms types. This fact is a most valuable consideration for a computer-assisted code elucidation. The radial distribution function in this form meets the entire requirement mentioned above, especially invariance against linear translations. As RDF060u has negative sign for its mean effect, as molecules became larger their RDF factor in 6?Å radius reduces. For example Tebufenpyrad (compound 250) which has larger structure than Pyrimitate (compound 224) has lower RDF factor (18.79) than Pyrimitate (29.76); so the retention time of Tebufenpyrad is higer (29.06?min) than Pyrimitate (20.59).

From the above discussion, it can be seen that all descriptors involved in the QSRR model has physical meaning, and these descriptors can account for the structural features that affect the retention time of under studied pesticides.

3.2. Comparison of Neural Network and MLR Models

A graphical comparison of ANN and MLR analysis is given in Figure 4, where the retention time values calculated by means of the respective models are plotted against the experimental values. The statistical parameters obtained by ANN and MLR models for these sets are shown in Table 4. The standard errors of training, test and validation sets for the MLR model are 2.402, 1.858, and 2.036, respectively, which would be compared with the values of 1.559, 1.517, and 1.249, respectively, for the ANN model. Comparison between these values and other statistical parameters in Table 4 reveals the superiority of the ANN model over MLR ones. Figure 5 shows the plot of the residuals against the experimental values of retention time, for the ANN model. Since the residuals are propagated on both sides of the zero line, there is no systematic error in developing of ANN model.

4. Conclussions

Few structure-activity relationships involving pesticides have been published. In this study, we use MLR and ANN to predict the retention time of 300 pesticides that were different in molecular structure. The results of this study demonstrate that QSRRs method using ANN techniques can generate a suitable model for prediction of gas chromatographic retention of pesticides.

Also the results obtained in this work indicate that the regression and ANN models exhibit reasonable prediction capabilities. Descriptors which appeared in the obtained QSRR models reveal that electronic interactions as well as steric parameters can be affected on the gas chromatographic retention time of pesticides.