#### Abstract

(E)-N-Aryl-2-ethene-sulfonamide and its derivatives are potent anticancer agents; these compounds inhibit cancer cells proliferation. A study of quantitative structure-activity relationship (QSAR) has been applied on 40 compounds based on (E)-N-Aryl-2-ethene-sulfonamide, in order to predict their anticancer biological activity. The principal components analysis is used for minimizing the base matrix and the multiple linear regression (MLR) and multiple nonlinear regression have been used to design the relationships between the molecular descriptor and anticancer properties of the sulfonamide derivatives. The validation of the models MLR and MNLR has been done by dividing the dataset into training and test set, the external validation of multiple correlation coefficients was RpIC50 = 0.81 for MLR and RpIC50 = 0.91 for MNLR. The artificial neural network (ANN) showed a correlation coefficient close to 0.96, which concluded that this latter model is more effective and much better than the other models. This obtained model (ANN) has been confirmed by two methods of LOO cross-validation and scrambling (or Y-randomization). The high correlation between experimental and predicted activity values was observed, indicating the validation and the good quality of the derived QSAR model.

#### 1. Introduction

Cancer is a major public health problem. The number of new cancer cases in 2012 was estimated to be at 14, 1 million and 8, 2 million deaths. It was estimated in 2008 that more than 70% of the fatalities due to cancer originated from developing countries. The frequency of cancers may go up by 50%, with 15 million new cases per year in 2020 [1, 2].

It is previewed that in 2030 the number of the fatalities due to cancer will increase up to 13.1 million. Even though its incidence is increasing in most regions of the world, the incidence rate is the highest in the more developed regions, but its fatality is relatively higher in the developing countries because of the inaccessibility to treatment and the absence of preliminary detection [3, 4].

One in two men and one in three women are affected by cancer. The distribution of cancers by age shows the trend since the beginning of registration; namely, there are a high number of cancer cases among women, taking into account the fact that the onset of cancer in women is at the age of 39 years, while for men it is at the age of 49 years. The number of cancer cases decreases from the age of 65 years in women whereas it begins to increase in men in this age [5].

Scientific advances of recent years now make it possible to decipher the genetic code of cancer and to understand how this disease is related to the mechanisms of life itself. For this reason, the cancer disease will probably never be completely eradicated, and basic research is redirected to the pharmaceutical industry that provides a plethora of increasingly targeted molecules, ever more efficient and always more expensive, so as to combine them in the manufacture of medicines.

Many sulfonamide derivatives have been reported to finally show in the pharmaceutical industry significant antitumor properties. The sulfonamides were the first widely used medications and they are systematically used as preventive and chemotherapeutic agents against various diseases. Over 30 medicines containing this feature are used clinically, including the antihypertensive bosentan, antibacterial, antiprotozoal, antifungal, anti-inflammatory, and nonpeptide antagonists of vasopressin receptors [6–8].

Sulfonamides are compounds which have a general structure represented in Figure 1. After the discovery of sulfanilamide, thousands of chemical changes were studied and the best therapeutic results were obtained from compounds where hydrogen ring (SO_{2}NH_{2}) has been replaced by heterocyclic [9, 10]. To date, more than twenty thousand sulfanilamide derivatives were synthesized. These syntheses have resulted in the discovery of new compounds having pharmacological properties that vary in said main structure; R, R_{1} may be hydrogen, alkyl, aryl, or heteroaryl, and so forth [11, 12].

Chemically, for its anticancer activity, sulfonamide exists in a variety of pharmacological targets. Drug discovery is a long and complex process. This discipline can occur at different levels of the process of drug discovery. Among the techniques of chemoinformatics, we can mention QSAR techniques of finding a correlation between biological activity measured for a panel of compounds and some molecular descriptors. Quantitative structure-activity relationship (QSAR) methodology is an essential tool in medicinal chemistry. The modern QSAR appeared since the year 1960 but the first correlation investigations of biological activity with physicochemical properties began nearly 60 years before the important work of Overton and Meyer linking aquatic toxicity for partitioning of lipids in water. In 1962 came the seminal work of Corwin Hansch and colleagues, which arose great interest in predicting biological activities. Since 2011, the QSAR studies have begun to grow, with more than 1,400 publications per year [13]. QSAR techniques are based on the concept postulating that similar structures have similar properties and that the more the molecules are different, the harder it is to correlate the physicochemical properties and biological activity, whereas the opposite is easier [14]. The application of the quantitative structure-activity relationship (QSAR) technique for the purpose of molecular modeling and drug design has provided a potential approach in the field of computational chemistry. This tool deals with the determination of the quantitative correlation between molecular structures and their activities employing various chemometric tools. The prime importance of the QSAR technique lies in its ability to determine the essential structural requirements of the molecules for exhibiting definite responses and to predict the activity of untested molecules followed by the design of virtual libraries [11]. On the other hand, QSAR studies were reported to pick out important structural features answerable for the anticancer activity [12]. The quantitative structure-activity relationships (QSAR) are assuredly a significant factor in coeval drug design. Consequently, it is quite evident why a great number of users of QSAR [13, 14] are located in industrial research units. So, classical QSAR and 3D-QSAR are highly active areas of research in the drug design [15, 16]. The basis for different quantitative structure-activity relationship (QSAR) methods is the “description” of the molecular structures by means of numbers. Right now, there are a large number of molecular descriptors that can be used in QSAR studies [17–19].

Our objective is to highlight a fundamental and original research on molecules sulfonamide basic core, in order to develop the relationship between the structure and the activity of the active chemical substance and its derivatives.

In this study, principal component analysis (PCA), multiple linear regression (MLR) analysis, multiple nonlinear regression (MNLR) analysis, artificial neural network (ANN) calculations, Crosse validation, and scrambling or Y-randomization are applied to a series of (E)-N-Aryl-2-ethene-sulfonamide inhibitors in order to set up a 3D-QSAR model reliable to predict anticancer activity.

#### 2. Materials and Methods

##### 2.1. Experimental Data

In the present study, we chose 40 substitutes of (E)-N-Aryl-2-ethene-sulfonamide of which the anticancer activities are reported in the literature by Shiri et al. [20]. On the other hand and for the 3D-QSAR study, the reported values of IC50 have been converted into pIC50 by taking negative logarithm (pIC50 = log10 IC50) and subsequently used as the dependent variable for the 3D-QSAR model development. Figure 1 presents the basic structure of the flavonoids and Table 1 shows the studied compounds and their corresponding experimental activities and pIC50.

##### 2.2. Computational Methods

The quantum method DFT (density functional theory) was used in this study to predict the various physicochemical properties, in order to search for the best correlation between themselves. 3D structures of the molecules were generated using the Gauss View 3.0, and all quantum calculations of all compounds have been carried out using the Gaussian 03. The optimization geometry and the physicochemical properties of the 40 compounds were predicted by employing the B3LYP function coupled with 6–31 G basis set [21, 22].

##### 2.3. Calculation of the Molecular Descriptors

Before every modelization, it is necessary to calculate or to measure a big amount of different descriptors because the mechanisms which determine molecules’ activity or one of its properties are frequently bad-known. Thus, we must select among variable cells the ones that are the most pertinent to modelization. This selection is done using the regression linear multiple methods. The optimized molecules were used to calculate a certain number of electronic descriptors: dipolar moment (DM), orbital borders energy (, ), total energy , and repulsion energy (RE) [23].

ChemBio Office (2015) was used to calculate the following parameters: molecular weight (MW), partition coefficient (), the hydrogen bond acceptor (HA), and the hydrogen bond donor (HD) [24–26].

ChemSketch program was used to calculate the following parameters: molar volume (MV (cm^{3})), molar refractivity (MR (cm^{3})), parachor (Pc (cm^{3})), density (g/cm^{3}), refractive index, surface tension (dyne/cm), and polarizability (cm^{3}) [27, 28].

##### 2.4. Statistical Analysis

To present the structure-activity relationship for 40 studied molecules, 16 descriptors are calculated using the Gaussian 03, chemoffice 2012, and chemsketch. The procedure of the study is as follows.

The principal component analysis (PCA) [29] was generated using the software XLSTAT, version 2015 [30], to predict anticancer activities pIC50. It is a statistical method based on minimizing all the information encoded in the structures of the compounds. It is also very helpful to understand the distribution of the compounds. This is an essentially descriptive statistical method which aims to present, in a graphic form, the maximum of information contained in the data listed in Tables 1 and 2.

The multiple linear regression statistic techniques are used to study the relationship between one dependent variable and several independent variables. It is a mathematical technique that minimizes the difference between actual and predicted values. It also serves to select the descriptors that are used as input parameters in multiple nonlinear regressions (MNLR) and the multiple linear regression.

We also used the technique of nonlinear regression model to improve the structure-activity relationship to quantify the substituting effect. We applied the data matrix formed clearly from the descriptors proposed by MLR corresponding to the 39 molecules (training set). The coefficients and and -values are used to select the best regression performance. We used a preprogrammed function of XLSTAT as follows:where represent the parameters and represent the variables.

All the feedforward NN used in this paper are three-layer networks; the first (input) layer comprises six neurons, representing the pertinent descriptors obtained in MLR technique [31, 32], whereas there are neither theoretical nor empirical rules to determine the number of quiet layers or the number of neuron layers; one hidden layer seems to be sufficient in the most chemical application of ANN. Some authors [33, 34] have offered a parameter *ρ*, leading to determining the number of hidden neurons, which plays a major role in determining the best ANN architecture. It is defined as follows: *ρ* = (number of data points in the training set/sum of the number of connections in the NN).Therefore, in order to avoid overfitting or underfitting, it is recommended to take into consideration the value: [35]. Thus, the ANN used in this work is formed by two hidden neurons, and the output layer represents the calculated activities values [35, 36].

The cross-validation technique is one of the most famous ways for the selection of regression models based on minimizing the cross-validation (CV) criterion of Stone among [37] an appropriate class of model candidates. This may be particularly motivated when augury (or, similarly, estimation of the unknown regression function) is the aim of the statistical analysis [38].

The additional procedure named scrambling must be used to validate the model and also to verify that this obtained model cannot change. In some cases, the built models have lower performance on the test set. These cases can be explained by the quality of the obtained descriptors of the aforementioned models. To assess the share of luck in models built, a technique called scrambling or Y-randomization is used [39].

#### 3. Results and Discussion

##### 3.1. Dataset for Analysis

QSAR study was carried out on a series of 40 substitutes of (E)-N-Aryl-2-ethene-sulfonamide, in order to determine a quantitative relationship between the structure and the antiviral activities. The values of the 16 descriptors are shown in Table 2. The results obtained for 3D-QSAR using MLR, MNLR, ANN, CV, and Y-randomization are represented in Tables 3 and 4.

##### 3.2. Principal Component Analysis

The totality of the 16 descriptors coding the 40 molecules is submitted to a principal components analysis (PCA) [32]. 16 principal components were obtained (Figure 2).

The first three principal axes are sufficient to describe the information provided by the data matrix. Indeed, the percentages of variances are 43.24%, 21.12%, and 13.00% for the axes , , and , respectively. The total information is estimated to a percentage of 77.36%. Table 5 shows the correlation matrix (Pearson ()) therefore obtained between different descriptors.

The Cartesian diagram analyses of projections are according to the plane (60.57%) of the total variance of the studied molecules.

The Pearson correlation coefficients are summarized in Table 4. The obtained matrix provides information on the negative or positive correlation between variables. The principal component analysis (PCA) was conducted to identify the link between the different variables. The correlation coefficients between the 11 descriptors are shown in Table 3 and in Figure 3 these descriptors are represented in a correlation circle.

##### 3.3. Multiple Linear Regressions (MLR)

In order to select the predominant descriptors that will affect the inhibitory activities of these compounds, correlation analysis was performed with statistical software XLSTAT 13 taking every calculated descriptor as an independent variable and pIC50 as a dependent variable. Based on the correlation analysis, the aforementioned stepwise multiple linear regression techniques were used to establish the QSAR model. However, this method uses the coefficients , , , MSE, MAE, and -values in order to select the best regression performance, where is the correlation coefficient, is the coefficient of determination, MSE is the mean squared error, MAE is the mean absolute error, and is the Fisher -statistic. , the hydrogen bond acceptor (HA), the hydrogen bond donor (HD), (), and index are the descriptors dependent on the anticancer activity of a sulfonamide.

The QSAR model obtained by using the multiple linear regression (MLR) method is represented by the following equations:The statistical characteristics of the obtained equation are as follows: . . . . . . . .As indicated in the abovementioned equation, the most significant descriptors affecting anticancer activity for inhibiting cancer cells proliferation are electronic descriptors () and steric descriptors ( index, HA, HD, and ), in which these descriptors are used to build the QSAR model for the of (E)-N-Aryl-2-ethene-sulfonamide.

The obtained model exhibits a high correlation coefficient () with electronic and steric descriptors. On the other hand, the results revealed that the relationship between the anticancer activities and the parameters of the molecules was not linear () for the compounds based on (E)-N-Aryl-2-ethene-sulfonamide.

For our 40 compounds, the correlation between experimental activity and the calculated one based on this model is quite significant (Figure 4) as indicated by statistical values.

Where is the number of compounds, is the correlation coefficient, is the Fisher -statistic, and MSE is a mean square error.

Several statistical parameters such as regression coefficient (), square correlation coefficient (), adjusted square correlation coefficient (), standard error of estimate (), Fischer’s value (), and significance level are used to check the credibility of the developed models.

Generally, a good QSAR model has the characteristics of a large , small MSE and MAE, very small value, and , close to one. Equation (2) has verified these criteria; therefore, it is statistically acceptable. Fisher’s test is used. Given that the probability corresponding to the value was lower than 0.05 for anticancer activity, there is a lower than <0.01% risk in assuming that the null hypothesis is false. Therefore, we can conclude with confidence that the model will correctly predict the anticancer activity of a given compound.

As shown in (2), the descriptors having negative influence in the anticancer activity (pIC50) of the given compound are the Index and the HA, and the parameters having positive influence in the activities are the HD and . The HA has a negative sign in the model, which suggests that increased activity can be achieved by decreasing the number of heteroatoms (nitrogen or oxygen atoms). The HD has a positive sign in the model, which suggests that increased activity can be achieved by increasing the heteroatom with one or more hydrogen atoms. On the other hand, we remark that has a positive sign in the model, which suggests that the substitution using a stronger donating of electrons such as phenyl and ethyl increases anticancer activity of a given compound.

The true predictive power of our model is to test its ability to perfectly predict compounds from an external test set (compounds that were not used for the developed model), by using the 16 compounds that remained from dataset used for building the quantitative model MLR, and the obtained results are tabulated in Table 3. Through these results, we remark that the correlations between the experimental plots and data predicted from multiple regression derived QSAR values are much closer, which shows that the developed models can be successfully applied to predict the inhibition of other derivatives of sulfonamide. Figure 2 shows a very regular distribution of anticancer activity values based on the observed values depending on the experimental values.

##### 3.4. Multiple Nonlinear Regression (MNLR)

The basic descriptors correspond to the MLR; 32 compounds were applied to the data matrix. The coefficients , , mean absolute error (MAE), and the mean squared error (MSE) are used to select the best performance of the regression.

The resulting equations are as follows: , , , , .The correlation coefficient obtained in (3) is very interesting (0.91) which shows that the anticancer activity obtained from this model is close to those obtained experimentally, comparing the results obtained by the MLR method (Figure 5).

##### 3.5. Artificial Neural Networks

In order to increase the probability of a good characterization of the studied compounds, neural network (ANN) has been used to generate the predictive models of quantitative structure-activity relationships (QSAR) between a set of molecular descriptors obtained from MLR and the activity observed experimentally. The correlation of the observed activities with the ANN calculated ones is illustrated in Figure 6. The correlation coefficient and Standard Deviation (SD) = 0.211, obtained with the neural network, show that the selected descriptors by LMR are pertinent and that the model proposed to predict the activity is relevant: . . . . . .

##### 3.6. Validation

We use the procedure “Leave-One-Out” which removes successively a molecule of learning the game containing 32 molecules. This procedure is repeated 32 times in order to predict the properties of all molecules.

The consistency and reliability of the MLR and ANN model are validated using the cross-validation technique with a good correlation obtained with cross-validation . So the predictive power of this model is very significant (Figure 7). Consider the following: . . . . . .The most important result of this investigation is that in vitro anticancer activity may be predicted using QSAR methods. It affirms that artificial neural networks results were the best to build the quantitative structure-activity relationship models. So, the model proposed in this study indicates high predictive power ().

In this part, QSAR model includes some molecular descriptors, and the regression quality indicates that these descriptors provide valuable information and have a significant role in the assessment of the activity of (E)-N-Aryl-2-ethene-sulfonamide. The artificial neural network (ANN) techniques, considering the relevant descriptors obtained from MLR, showed that a good agreement between the observed and the predicted values was prime and the obtained model has been confirmed by cross-validation method (LOO).

##### 3.7. Scrambling or Y-Randomization

Y-randomization is widely used in QSAR studies to ensure the robustness of the models obtained. Scrambling validates the QSAR model by comparing the performance of the original model to that of models built for permuted (randomly shuffled) responses based on the original descriptor pool and the original procedure used to build the model [40].

The aim of this procedure is to randomly mix the properties of the model in the original game and recreate cross-validation models. Consider the following: . . . . . .The correlation coefficient value of the mixture molecules is the same as that obtained by applying the full model (Figure 8). These results demonstrate the absence of dependence between descriptors eject and inclined in model; also the closest measurement points of the target point do not hide other experimental data and are not involved almost exclusively in the estimate, and the data used in this validation is regularly distributed in space so the resulting model can be extrapolated for the entire series.

##### 3.8. Five Rules of Lipinski

Each prospective drug must comply with several basic criteria, such as its low cost of production and being soluble and stable, but must also comply with the scales associated with the pharmacological properties of absorption, distribution, metabolism, excretion, and toxicity. The list of aspects to be included is very long but it is commonly accepted that the molecules validating at least two conditions of the Lipinski rule are potential candidates. These rules are as follows:(1)The molecular weight is less than or equal to 500 Da.(2)It has 5 or less hydrogen bond donors (sum of OH and NH).(3)It has 10 or fewer hydrogen bond acceptors (sum of O and N).(4)Its value is less than or equal to 5. is equal to the logarithm of the ratio of the concentrations of the test substance in octanol and in water: . This value allows us to understand the hydrophilic or hydrophobic (lipophilic) molecules of our compounds in octanol and water solvent, where .

Various statistical studies have identified the best values for a compound (typically a drug) absorbed by the human body as follows:(i)Brain penetration: 2.0(ii)Oral absorption: 1.8(iii)Sublingual absorption: 5(iv)Percutaneous absorption: 2.6.Molecules that violate many of these rules can have problems with bioavailability. Therefore, this rule establishes some relevant structural parameters for the theoretical prediction of the oral bioavailability profile and is widely used in the design of new drugs.

The results of the calculation (Table 6) show that all compounds satisfy the rules of Lipinski, suggesting that these compounds theoretically do not have problems with oral bioavailability except molecule 32 which has 11 acceptor sites and molecules 39 and 45 whose molecular weight is up to 500 Da [40, 41].

#### 4. Conclusion

Theoretical calculations were performed on a series of sulfonamide chemical information presented in the form of mathematical equation. The multiple linear regression (MLR) and multiple nonlinear regression analyses were used to concept a quantitative structure propriety relation model of (E)-N-Aryl-2-ethene-sulfonamide analogs compounds for their properties and anticancer activity.

The most important result of this investigation is that in vitro anticancer activity may be predicted using QSAR methods. It affirms that artificial neural networks results were the best to build the quantitative structure-activity relationship models. So, the model proposed in this study indicates high predictive power ().

In this part, QSAR model includes some molecular descriptors, and regression quality indicates that these descriptors provide valuable information and play a significant role in the assessment of the activity of (E)-N-Aryl-2-ethene-sulfonamide analogs. Furthermore, we can conclude that the studied descriptors, which are sufficiently rich in chemical, electronic, and topological information to encode the structural features, may be used with other descriptors for the development of predictive QSAR models.

#### Conflicts of Interest

The authors declare that they have no conflicts of interests.

#### Acknowledgments

The authors are grateful to the “Association Marocaine des Chimistes Théoriciens” (AMCT) for its pertinent help concerning the programs.