Abstract

The antimalarial activity of a series of 4-anilinoquinolines was modeled with topological and other functional descriptors using feature selection approaches CP-MLR and GA. Five models were identified from each approach to explain the activity of the compounds. They jointly shared eighteen descriptors. Among them five descriptors, namely, H-052, MATS4m, MATS7e, Mor30p, and R7m, were common to both approaches. In PLS analysis the eighteen descriptors have led to a three-component model (, , ). and the common descriptors were among the most influential ones to modulate the activity. Among them, MATS7e indicated the favorability of nonlinear and branched molecular topology for higher activity. MATS4m has also advocated in favor of branching/nonlinearity in the molecule for the activity. The H-052 argued that R'CH2-CHX-CH2R fragments (X is halogen) in the scaffold enhance the activity. In BP-ANN these descriptors led to very good predictive models (training validation ; test ). The study has offered direction to understand the patterns of the antimalarial activity of anilinoquinolines for exploring potential prototype compounds.

1. Introduction

Malaria is a vector-born parasitic infection (vector: female mosquitoes of the Anopheles genus; parasite: protozoa, genus Plasmodium) of the tropical regions with serious health and economic implications. The interventional measures of the last decade have resulted in the form of some relief to the incidents of deaths due to malaria. However, these efforts did not decline the manifestation of drug resistance cases [1]. In fact, until the recognition of drug-resistant strains of Plasmodium falciparum, the treatment of malaria has heavily relied on chloroquine as first-line drug [2, 3]. Also, in clinical practice chloroquine suffers due to several limitations/side effects which include gastrointestinal, stomach, neural, and blurring of vision [4, 5]. The mechanistic investigations on the antimalarial activity of this (quinoline) class have indicated that chloroquine and other analogues follow similar pathway in the expression of the activity [6, 7]. The drug resistance of parasite is compound centric and not due to altered mechanism of action [810]. This has renewed the research interest to explore alternative quinolines as potential antimalarial agents. Moreover, existence of large preclinical and clinical information and low cost/ease of preparation of alternative/new drugs or drug-like molecules encouraged the researchers to venture into this chemical class [1116].

In quinoline class of compounds, amodiaquine (Figure 1) is a clinically practiced antimalarial agent [17]. The chloroquine-resistant Plasmodium parasites are not automatically cross-resistant to it [18]. However, amodiaquine is reported to cause agranulocytosis and hepatitis [19, 20]. The side effects are attributed to the 4-hydroxyanilino moiety of amodiaquine. In biological system it undergoes enzymatic oxidation to the quinoneimine form and makes nucleophilic addition to proteins [21, 22]. In this scenario, to overcome the undesirable side effects of amodiaquine, different 7-chloro-4-(3′,5′-disubstituted anilino)quinolines were explored as alternative antimalarial agents [2326]. These compounds structurally resemble amodiaquine but are devoid of amodiaquine’s 4-hydroxyl on the anilino moiety which is attributed for the side effects.

In medicinal chemistry paradigm the rational drug design approaches, which include quantitative structure-activity relationship (QSAR) and molecular modeling protocols, cull out structural and functional information of chemical entities desirable for biological response. This may come handy to modulate/design the biological response of intended compounds. Here structure-activity elucidation of the compounds is attempted taking into account the correlation between the chemical structure space indices and their biological response landscape. The earlier QSAR study [27] on some 7-chloro-4-(3′,5′-disubstituted anilino)quinolines, involving 2D molecular features, has denoted that 3′- and 5′-substituents of the anilino moiety map different domains with substructure preferences in the activity space. It also gave indication in favor of the electron rich centers in the aniline substituent groups for better antimalarial activity. In this background, the QSAR analysis of the antimalarial activity of an enlarged dataset of 4-anilinoquinolines has been undertaken with a perspective to broaden the structural information relevant to the activity space. The results are presented hereunder.

2. Materials and Methods

2.1. Chemical Structure Database and Biological Activity

A dataset of 90 anilinoquinolines (Figure 1(b)) along with their antimalarial activity (IC50, inhibitory concentration or dose in micromoles of compound to reduce 50% FcB1R strain of P. falciparum) reported in the literature was considered for this study [2326]. The substitution positions in these compounds are briefly summarized in Table 1. The antimalarial activity (IC50) of all these compounds was reported using the same experimental protocol [2326]. The compounds exhibited good variation (~2.6 orders) in their antimalarial activity. For the purpose of modeling study the activity has been transformed in the form of logarithm of inverse of inhibitory concentration and expressed as pIC50 (Table 2).

The structure database of the compounds under investigation has been generated using the X-ray crystal structure of amodiaquine (Figure 1(a)) [28] to impart 3D characteristics to the chemical space of the agents. Accordingly, in SYBYL [29] by making use of the procedure implemented therein the 3D structures of the compounds (Figure 1(b)) were generated from the X-ray crystal structure of amodiaquine (Figure 1(a)). In Dragon software [30] these conformations have resulted in 490 and 686 descriptors, respectively, to profile the 0D to 2D and 3D characteristics of the molecules. Prior to the QSAR study, all those descriptors showing a correlation of less than 0.1 with the dependent variable (descriptor versus activity ) and descriptors showing intercorrelation greater than or equal to 0.9 () were excluded. This has reduced the 0D to 2D and 3D descriptors to 101 and 131, respectively, for correlating with the activity.

For QSAR study, using the fingerprints of BIT-packed version of Molecular ACCess System (FP-BIT-MACCS) of the compounds, the dataset was divided into two mutually exclusive groups as training and test sets. The concepts of molecular finger-prints were originally introduced by Molecular Design Limited, Inc (MDL) as a part of informatics services to the life sciences and chemical industry [31]. In molecular operating environment (MOE) software [32], the cluster analysis the MACCS fingerprints of the compounds was carried out at 85% similarity to segregate them (compounds) into training and test sets. All compounds were arbitrarily put into training set (50 compounds) and test set (40 compounds) in such a way that members of the clusters were distributed in both the sets. Furthermore, to facilitate the comparison of significance of descriptors with one another in the derived models, all the descriptor values are scaled between “0” and “1” (inclusive of both values). For this the original descriptor values have been scaled using the following transformation: where , and are the training set feature ’s original, minimum, maximum, and transformed descriptor values, respectively.

The significance of obtained molecular features in explaining the antimalarial activity of the compounds has been investigated using the combinatorial protocol in multiple linear regression (CP-MLR) [33], genetic algorithm (GA) [34], partial least squares (PLS) [35, 36], and artificial neural networks (ANN) [37] methods. Only the training set compounds were used for deriving the models and the test set compounds were used for the external validation of the derived models. Purposefully a large test set was created to facilitate a follow-up study of derived models in back-propagation artificial neural networks (BP-ANN). The modeling procedures and the computations are briefly described below.

2.2. CP-MLR

Combinatorial protocol in multiple linear regression (CP-MLR) is filter-based variable selection procedure [33]. The procedural aspects are discussed in some of the recent publications [38, 39]. This operates through a combinatorial seeding strategy followed by predefined filters to assess the significance of seeds and finally employs MLR to develop the models from the significant seeds. A unique combination of descriptors (variables) is referred to as a seed. Here, filter-2 controls the seeds through -values (default -value ≥ 2.0) of the coefficients of individual descriptors of the seed in regression; filter-2 controls the seeds through -values of variables’ coefficients in regression which is set as greater than or equal to two; filter-3 provides a comparison of seeds in different equations in terms of square root of adjusted multiple correlation coefficient of the regression, -bar; filter-4 estimates the consistency of the equation in terms of cross-validated or with leave-one-out (LOO) cross-validation as default option (). In CP-MLR, for the selection of features from datasets the initial threshold of filter-1 was assigned as 0.3 and subsequently liberated to 0.79 to boost the formation of different seeds. The search was started with two-variable seeds and with an initial filter-3 value of 0.74. The information rich descriptors were collected by successively incrementing the number of variables per seed as well as the threshold of filter-3 to the optimum -bar value of the preceding generation.

2.3. GA

The genetic algorithm variable subset selection (GA-VSS) routine as implemented in MOBY DIGS [40, 41] was used for the selection of GA features. It has proceeded with an initial population of one hundred solutions (chromosomes) with maximum allowed variables in a solution as five. The fitness for each chromosome was calculated based on leave-one-out (LOO) cross-validation (). The reproduction/mutation trade-off () value was set to 0.5. Based on the value, the crossover and mutation values of GA were automatically fixed in situ in the computation. The optimum solutions were identified at the end of one hundred generations of GA evolution process (selection, crossover, and mutation).

The models emerged from the CP-MLR and GA approaches are further regressed for the chance correlation through one hundred simulation runs with repeated randomization of biological response [42, 43]. The correlation coefficients of simulated regressions have been used to determine the average correlation coefficient of Y-randomization as well as percent chance correlation of the model under scrutiny. Also, the derived models are externally validated by predicting the activities of test set compounds which are not a part of the model generation exercise. The test set predictions are used for computing the test set -square statistics () of the model in question. Normally, models with value greater than 0.5 are treated as reliable. Finally, the descriptors identified in CP-MLR and GA have been further subjected to the partial least squares (PLS) [35, 36] analysis to present single-window QSAR models comprising all identified descriptors.

2.4. Applicability Domain

The usefulness of a model may be declared based on its ability to predict new compounds. In this context, applicability domain defines the predictive space of a model. The training set data, when projected in the model’s multivariate parameter space, demarcates the plotting regions as populated with data and empty ones. Here, the populated regions define the applicability domain of the model and indicate that the space is suitable for the predictions. Computationally, the applicability domain of the models is evaluated through the plot of standardized residuals versus leverage values () for each compound [44]. It is also known as Williams plot and is useful for the detection of both the response outliers (-outliers) and structurally influential chemicals (-outliers) in the model. In this plot, the applicability domain is determined inside squared area within times standard deviations (where may be given a value between 2 to 3) and leverage threshold () which is typically fixed at (where is the number of compounds in the training set and is the number of parameters in the model). In this plot, if a compound’s leverage value () is smaller than the , the probability of its prediction come true may be as high as that of the training set compounds. Making use of these settings, the applicability domain of the models from the CP-MLR, GA, and PLS have been scrutinized for their predictive capability.

2.5. BP-ANN

In ANN modeling a training set was used for the model generation while a validation set was applied to stop the overfitting of the network. Additionally a test set was used to verify the predictivity of the generated model. In computation, the CP-MLR/GA training set (50 compounds) was considered as such for the training the network of ANN. However, the test set (40 compounds) of the CP-MLR/GA was randomly divided into ANN’s validation (20 compounds) and test (20 compounds) sets. Coinciding with the number of descriptors in individual feature selection models, for ANN also five descriptors were considered in the input. Before training the networks, the input and output values were normalized with autoscaling of all data. The initial weights were selected randomly between (−0.3) and (0.3). Using the standard evaluation procedure with different numbers of hidden layer nodes, the optimum number of nodes for the hidden layer was assessed. The goal of training the network is to minimize the output errors by changing the weights between the layers [37]. Equation (2) gives the changes in the values of the weights in the network in the optimization of the output, as follows: In this, is the change in the weight factor for each network node, is the momentum factor, and is a weight update function, which indicates how weights are changed during the learning process. The weights of hidden layer were optimized using the second derivative optimization method, namely, Levenberg-Marquardt algorithm [45, 46].

2.6. Levenberg-Marquardt Algorithm

In this algorithm, the update function, , is calculated the following using equations: where is gradient, is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights, and is a vector of network errors. The parameter is multiplied by some factor () whenever a step would result in an increased and when a step reduces is divided by .

2.7. Statistical Parameters

In training the network, the over-fitting of data was controlled by comparing the root-mean-square errors (RMSEs) of training and validation sets. It measures the goodness of the output and is useful for the comparison of the target values. The training of the network for the prediction of target value was stopped when the RMSE of the validation set began to increase while that of training set continues to decrease. The goodness of fit of activity of the test set compounds was used to further validate the developed models. The predictive ability of the constructed models was assessed using different statistical measures, namely, the training, validation, and test sets’ correlation coefficients (), and corresponding root-mean-square error of prediction (RMSEP), relative standard error of prediction (RSEP), and mean absolute error (MAE) values. More information on the statistical parameters can be found in applied statistics handbook [47]. The statistical parameters used in the study are calculated using the following equations: where is the observed activity, is the mean of observed activity values, is the predicted activity of the compound in the sample, and is the number of samples in the concerned set. The ANN computations were carried out using the MATLAB 7.6 for Windows [48].

3. Results and Discussion

The QSAR analysis of the antimalarial activity of anilinoquinolines has been carried out in CP-MLR and GA approaches using the 0D to 3D features of the molecules from Dragon software. At the end of the analysis, from each approach, five 5-parameter equations were identified as significant ones to model the activity of the compounds. The models identified from each approach are shown in Table 3. There are no common models between CP-MLR and GA approaches. However, several descriptors are common to models from both approaches. The models have predicted the activities of training and test set compounds within the reasonable limits of their actual values. Statistically, they have explained between 66% to 69% variance ( to 0.69) in the activity of training set compounds and also predicted higher than 50% variance () in the activity of test set compounds (Table 3). For selected CP-MLR and GA models the training and test set predictions are shown in Table 2.

The equations from CP-MLR have jointly shared eleven descriptors and likewise the GA equations have shared twelve descriptors (Table 4). Together, these models have led to 18 descriptors (Table 4) as information rich features to model the antimalarial activity of the compounds. All these descriptors belong to seven different classes, namely, functional groups (COOR, NR2), atom-centered fragments (H-047, H-052), 2D autocorrelations (MATS4m, MATS8m, MATS5e, MATS7e), radial distribution function (RDF085p), 3D molecule representation of structures based on electron diffraction signals (Mor15m, Mor28m, Mor17p, Mor30p), Weighted Holistic Invariant Molecular descriptors (E1m, E2m), and GEometry, Topology, and Atom-Weights AssemblY (R6m, R7m, R7m+, R6e+, RTe+) descriptors (Table 4). A brief physical meaning of these descriptors in terms of structural features is described in Table 4.

Among the identified variables (Table 4), 5 descriptors (H-052, MATS4m, MATS7e, Mor30p, and R7m) are common to both CP-MLR and GA approaches. Of these, MATS7e (Moran autocorrelation of lag 7 weighted by atomic Sanderson electronegativities) has appeared in all models with negative regression coefficient (Table 3). This has pointed that molecular topology leading to a reduced autocorrelation of lag 7 weighted by atomic electronegativities improves activity. This in turn explains that nonlinear and/or branched molecular topology leads to higher activity. The descriptors H-052 (with a positive regression coefficient) and Mor30p (with a negative regression coefficient) are part of all CP-MLR models as well as present in some GA models (Table 3). The H-052 argues in favor of R′CH2–CHX–CH2R fragments (X is halogen atom) in the scaffold for the activity. Mor30p is 3D molecule representation of structure based on specific electron diffraction weighted by atomic polarizability. It describes the mutual arrangement of atoms in molecule leading to the 3D distribution of chosen property, that is, polarizability. The negative regression coefficient of Mor30p recommends typical arrangement of atoms in molecule leading to small descriptor values for high activity. The descriptors MATS4m (with a positive regression coefficient) and R7m (with a negative regression coefficient) have appeared in selected models of both CP-MLR and GA approaches (Table 3). The positive regression coefficient of MATS4m shows that small path lengths and branching in the molecule (lag 4 weighted by atomic mass) contribute to higher activity. The R7m is also a kind of autocorrelation of lag 7 weighted by atomic mass derived from the molecular leverage matrix. The negative regression coefficient of R7m argues that similar or almost similar atomic leverages (of lag 7) raise the activity (Table 3). Apart from the foregoing features, RDF085p, E1m, E2m, RTe+, R6m, and R7m+ are exclusive to the models from CP-MLR and MATS8m, MATS5e, H-047, NR2, Mor15m, Mor17p, and R6e+ are exclusive to those from GA approach.

The descriptor RDF085p (Table 3; (2)) measures the probability of finding molecular constituents in a spherical volume of radius 8.5 Å weighted by atomic polarizability. Its positive regression coefficient argues in favor of this for improvement in antimalarial activity. The descriptors E1m and E2m (Table 3; (3)–(5)) represent 3D molecular information of atomic densities along principal axes 1 and 2 weighted by atomic mass. Principal axes of a molecule are from the eigenvalues and eigenvectors of weighted covariance matrix of its centered Cartesian coordinates. They are derived from the projections of the atoms (of the molecule) along each individual principal axis and convey information related to molecular size, shape, symmetry, and atom distribution. In the regression equations (Table 3; (3)–(5)) E1m and E2m are associated with negative and positive coefficients, respectively. This argues in favor of an atomic arrangement to maximize the 2nd principle axis of the molecule for high activity. Similar to R7m, the other GETAWAY class descriptors RTe+, R7m+, and R6m (Table 3; (1), (4), (5)) are associated with negative regression coefficients. In the molecules, while RTe+ accounts for the maximal molecular leverage autocorrelation index weighted by atomic Sanderson electronegativities, R7m+ accounts for the maximal molecular leverage autocorrelation of lag 7 weighted by atomic mass. The R6m is the molecular leverage autocorrelation of lag 6 weighted by atomic mass. All these descriptors advocate similar or almost similar leverages for high activity.

Concerning the descriptors exclusive to equations from GA, the functional groups descriptor nNR2 (Table 3; (9)) accounts for number of tertiary aliphatic amines in the molecule. Its positive regression coefficient speaks in favor of tertiary aliphatic amines for high activity. The H-047 (Table 3; (7), (8)) has appeared in GA equations with positive regression coefficient. In these analogues, it argues that unsubstituted methylenes lead to activity improvement. The 2D autocorrelation descriptors MATS8m, and MATS5e (Table 3; (10)) have appeared with negative regression coefficient. Both these descriptors infer in favor of molecular topology leading to a reduced autocorrelation of lag 8 weighted by atomic mass and lag 5 weighted by atomic electronegativities for improved activity. They further illustrate that nonlinear and/or branched molecular topology increases the activity. The descriptors Mor15m (Table 3; (9); positive regression coefficient) and Mor17p (Table 3; (6), (10); negative regression coefficient), similar to Mor30p, most probably show the influence of the specific distribution of atoms in the molecule on its activity. The GETAWAY class descriptor R6e+ has appeared in (Table 3; (6) and (10)) with positive regression coefficient. It represents maximal molecular leverage autocorrelation of lag 6 weighted by atomic Sanderson electronegativities. This suggests that increasing divergence in the leverage of lag 6 contributes to higher activity.

As a followup of feature identification, PLS analysis has been carried out on the eighteen descriptors of CP-MLR and GA and the five common descriptors of both the approaches to facilitate the development of single-window structure-activity models. For PLS analysis, the descriptors have been autoscaled (zero mean and unit s.d) to give each one of them equal weight in the study. In the cross-validation procedure of the PLS analysis [35, 36], three components are found to be the optimum to explain the activity of the compounds. The PLS model from the eighteen descriptors of CP-MLR cum GA has explained 73.1% variance (, , , ) in the antimalarial activity of the training set compounds and showed a test set value 0.676. Figure 2 shows a plot of the fraction contribution of normalized regression coefficients of these descriptors to the activity. Of the eighteen descriptors, the fraction contributions of five common descriptors of both approaches are found amongst the most significant ones to modulate the activity of the compounds. Also, the PLS model from these five common descriptors of CP-MLR and GA has explained 63.8% variance (, , , ) in the antimalarial activity of the training set compounds and showed a test set value 0.510. The MLR-like PLS coefficients of these two feature sets are shown in Table 5. All descriptors have conveyed the same meaning as in the case of regression equations from CP-MLR and GA.

The predictive ability of regression models derived from the CP-MLR, GA, and PLS approaches is assessed using applicability domain (AD) analysis. The AD plots for Eq. (1) and Eq. (6) and the PLS model are shown in Figure 3. They are from the models involving all the compounds, that is, training and test sets together. In the plots, the -outliers (response outliers) limits were set to 2.5 times the standard deviation units. In the AD plot of Eq. (6) (Figure 3(b)), two test set compounds are marginally outside the allowed region. Of these two, one compound (AQ14) is response outlier (observed residual value is 1.061; limiting residual value is ±0.993) and the other compound (AQ01) is leverage outlier (observed leverage is 0.366; limiting leverage value is 0.36). Except for these minor deviations, the AD plots argue in support of the predictive power of the presented models. Also the models are free from serious or influential outliers (Figure 3).

The models discussed so far could explain up to 73% variance in the activity. Prevalence of some degree of nonlinearity in the activity in relation to the structural features is among the main reasons for this kind of situation. Often the biological activity landscape of chemical entities is far more nonlinear when compared to their physicochemical (also other properties) arena. In modeling studies artificial neural networks (ANNs) have a special place to address these situations. In ANN, involving of descriptors from feature selection approaches is a desirable option as they provide direction for the modification of chemical space to carry out activity modulation [49]. In view of this the features of selected models of CP-MLR and GA (Table 3; (1) and (6)) and the five common descriptors of CP-MLR and GA (MATS4m, MATS7e, H-052, Mor30p, and R7m) have been used separately for the development of three BP-ANN models for the activity. The ANN architecture with network parameters and the predictive statistics of the emerged models are shown in Table 6. In ANN models, these descriptors have well explained the antimalarial activity of the compounds (). Also they gave satisfactory predictions for the test set compounds (test set ). The plots of observed versus ANN predicted activities are shown in Figure 4. In ANN models also the features of CP-MLR, GA, and common sets infer the same meaning as discussed in previous paragraphs. The results clearly demonstrated that these descriptors have the ability to identify the patterns in the data and predict the activity of potential analogues.

4. Conclusions

The antimalarial activity of a series of anilinoquinolines was modeled with the feature selection approaches CP-MLR and GA. This has led to the identification of eighteen descriptors to model the activity of the compounds. Among the identified descriptors, five (H-052, MATS4m, MATS7e, Mor30p, and R7m) are common to both CP-MLR and GA approaches. For the development of the single-window structure-activity model, all eighteen features were analyzed in PLS. In PLS analysis, the common descriptors of CP-MLR and GA are found among the most influential ones to modulate the activity of the anilinoquinolines. In regression as well as PLS models the negative coefficient of MATS7e argued that nonlinear and/or branched molecular topology leads to higher activity. H-052 represents the hydrogen(s) attached to sp3 carbon which is next to the carbon anchoring halogens. Its regression coefficient advocated in favor of such fragments for higher activity. The regression coefficient of H-052 advocated for the groups containing hydrogen of sp3 carbon attached to next carbon containing halogens in the substituents for higher activity. In BP-ANN, the descriptors from the selected equations of both feature selection approaches and the five most significant descriptors of PLS analysis (MATS4m, MATS7e, H-052, Mor30p, and R7m) have explained higher than 81% variance in the antimalarial activity of the training set compounds and showed a test set value greater than 0.75. These results offered direction to understand the patterns of the antimalarial activity of anilinoquinolines and may serve to predict the activity of potential prototype compounds. The values of the eighteen descriptors involved in the regression equations are provided as supplementary material to facilitate likely structural exploration (Supplementary material will be available online at http://dx.doi.org/10.1155/2013/154629).

Acknowledgment

This work is supported by CDRI Communication no. 8310.

Supplementary Materials

Supplementary Data: Molecular indices involved in derived models

  1. Supplementary Data