Abstract

QSAR based on molecular topology (MT) is an excellent methodology used in predicting physicochemical and biological properties of compounds. This approach is applied here for the development of a mathematical model capable to recognize drugs showing dyspnea as a side effect. Using linear discriminant analysis, it was found a four-variable regression equations enabling a predictive rate of about 81% and 73% in the training and test sets of compounds, respectively. These results demonstrate that QSAR-MT is an efficient tool to predict the appearance of dyspnea associated with drug consumption.

1. Introduction

1.1. Dyspnea as Side Effect

A side effect can be defined as an expected and known effect of a drug that is not the intended therapeutic outcome [1]. The US has been forced to remove from the market 75 drugs and combination drugs products since 1969 for safety reasons [2]. Although it is almost imperceptible if referred to all marketed drugs (less than 1%), safety-related regulatory actions (e.g. labeling changes, such as the addition of precautions, contraindications, or black box warnings) are much more common and less widespread. From 1969 to 2002, the Adverse Event Reporting System (AERS) of the Food and Drug Administration (FDA), as Nebeker et al. remarked, received approximately 2.3 million reports of adverse events on more than 6,000 drug products [2].

Adverse drug effects (ADEs), are a cause of injury or death to about 770,000 people each year, which may cost up to $5.6 million each year per hospital depending on the hospital size [3]. Therefore, anticipating the side effect profile for drugs is becoming more and more important in current drug discovery, development, and marketing. This strategy can lead to millions of dollars of savings in health care.

One side-effect having notable repercussions in national health care is dyspnea, defined by the American Thoracic Society as “a term used to characterize a subjective experience of breathing discomfort that is comprised of qualitatively distinct sensations that vary in intensity” [4]. Moreover, this symptom, characterized by the difficulty of getting sufficient air past the larynx, is associated with secondary physiological and behavioral responses [4].

Dyspnea is largely linked to people suffering from advanced cancer and cardiac, respiratory, and certain neurological diseases [5]. Patients mostly respond to breathlessness by adopting a sedentary lifestyle in order to relieve their symptomatology. This leads to social isolation, depression, fatigue, and dissatisfaction with life and significant emotional distress apart from skeletal muscle deconditioning [5, 6].

Dyspnea is one of the most distressing symptoms suffered by 30–75% terminal cancer patients [5]. The care of patients with dyspnea requires multidisciplinary resources such as palliative care, physiotherapy, respiratory medicine, and nursing [5].

There are different types of drugs causing dyspnea as a part of their side effects, such as chemotherapeutic agents (Capecitabine, Imatinib, Irinotecan…) [7], and other therapeutic drugs (Amphotericin Bs, Nicotine, Dofetilide, Cyclosporine…) [7].

Breathlessness has marked significant impact on the quality of life of the patients but has also an economic impact in our society. An example of that can be seen in a study quantifying the direct medical costs of dyspnea patients in 2008 or 2009 with a history of acute coronary syndrome (ACS) [8].

ACS patients included in the study were required to have six months of continuous medical enrollment prior to an emergency room (ER) visit. A total of 8433 emergency room (ER) visits for dyspnea were identified during 2008 to 2009 from these databases of approximately 74 million beneficiaries as reported by Bonafede et al. [8].

The average cost per dyspnea episode was $6958, associated with the ER visit, physician services, and diagnosis techniques such as electrocardiogram (71.3%) and chest radiograph (75.9%) as Bonafede et al. reported [8]. This is not all, more than one-fourth (25.8%) of dyspnea ER visits preceded an inpatient stay, with an average cost of $20 693 per patient. As the authors remarked, dyspnea is a significant event associated with high medical resource utilization and hospital costs.

1.2. Molecular Topology

Molecular topology (MT) is a discipline based in the topological description of molecules by using numerical invariants, called topological indices (TIs). These descriptors are able to characterize the most important features of molecular structure: molecular size, binding, cycles and branching.

Topological indices have the advantage of being true structural invariants, so that they are independent of the spatial and temporal position of the atoms in the molecule. However, TI’s extensions that give account of three-dimensional structure have been also devised [911].

Many physicochemical and biological properties have been predicted by MT up to date, including groups of compounds showing considerable structural diversity [1216]. Among the properties modeled stand several pharmacological activities such as anticonvulsant [17], antimalarial [18, 19], antimicrobial [20, 21], antifungal [22], antineoplastic [23], antihistaminic [24], bronchodilator [25], cytostatic [26], and anti-inflammatory compounds [27, 28] just to mention some examples.

No matter the application field, MT’s strength lies in the reliable prediction of specific activities or properties of molecules. This way it is possible to select or design new compounds, particularly new drugs, thereby producing a high social and economic impact.

In a very recent paper [29], we demonstrated MT’s effectiveness in predicting drug-induced anorexia. As a follow-up study, in the present work it is sought to raise one more step in the complex world of adverse effects to drugs, analyzing another side effect, namely, dyspnea.

2. Material and Methods

The application of the MT-model, involves the following steps (Figure 1).Step 1. Selection of dataset from the literature: the data were comprised of both drugs inducing (active) and not inducing (inactive) dyspnea.Step 2. Calculation of topological descriptors: for that purpose, we used Dragon software, version 5.4. [30].Step 3. Splitting of the dataset in two groups training set and test set: the criteria applied for such a splitting was based on the degree of dyspnea induction by the drugs. A level of 3% was used as threshold to distinguish dypsneagenic (above 3%) from non-dypsneagenic (below 3%) drugs. Step 4. Application of linear discriminant analysis (LDA) to the training set.Step 5. Validation of the LDA through an external test set. Step 6. Application of the topological model to the identification of potential dyspnea-inducers: (not carried out here).

2.1. Selection of Dataset and Indices Calculation

A model to distinguish the dypsneagenic (active), from the nondypsneagenic (inactive) drugs, was built up with a dataset of 176 drugs. The drugs were from the database named SIDER [7] and from the internet site “drugs.com” [31].

The dataset was split into two, namely, a training and a test set. The first set was made up of 33 compounds causing dyspnea as side-effect with an incidence rate greater than 3% (active set) and 68 compounds showing an incidence rate less than or equal to 3% (inactive set). The second was made of 30 compounds causing dyspnea (test active set) and 45 compounds not showing dyspnea as a side effect (test inactive set).

The chemical structure of each drug was depicted by using ChemBioDraw Ultra version 12.0 (CambridgeSoft Corporation, Cambridge, MA, USA).

2.2. Molecular Descriptors

Each compound was characterized by a set of 444 topological indices (TIs) obtained by Dragon software, version 5.4. Among the graph-theoretical descriptors calculated, the 2D autocorrelation indices demonstrated to be the most representative, and hence they were selected.

The chemical structures of the drugs studied were very heterogeneous. To guarantee that all groups were balanced, a study of molecular similarity using TIs was performed. Among the parameters calculated were the Tanimoto coefficient, TC, and the Euclidean distance, ED as follows:

Euclidean distance: Tanimoto coefficient: where and correspond to the topological indices of molecules and .

The topological similarity between two given compounds, and , will be higher when TC is closer to the unit and ED closer to zero [32].

Figure 2 shows the pairs of compounds with greater topological similarity (Adenosine, Lenalidomide and Clonazepam, Olanzapine for the active and the inactive training group, resp.) and with lesser similarity (Cyclosporine, Nicotine and Nitric oxide, and Ritonavir). The average values of the parameters TC and ED for all compounds studied (; ) are analog to those obtained for the training set (; and ; for the active and inactive group, resp.). The results for the test set were similar to those from the training set, what indicates that the groups were well arranged.

2.3. Modeling Techniques

Linear discriminant analysis (LDA) [33] is a statistical technique providing a classification based on the combination of variables that best predict the category or group to which a given object—a compound in our case-belongs. The compounds in the training set were allocated to active or inactive groups, according to their capability to produce dyspnea. Hence, the discriminant property was the capability of producing dyspnea as a side effect, and the independent variables were the TIs. The LDA final outcome is a discriminant function (DF), that is, an equation relating the activity, expressed in disjunctive terms in a Boolean way (1 = active; 2 = inactive), with the set of TIs. To get the LDA, the software Statistica version 9.0 (StatSoft, Inc., Tulsa) was used.

The discriminant capability was assessed as the percentage of correct classifications in each set of compounds. The classification criterion was the minimal Mahalanobis distance [34] (distance of each case to the mean of all the cases in a category), and the quality of the discriminant function was evaluated by using the Wilks parameter [35, 36], , which was obtained by multivariate analysis of variance that tests the equality of the group means for the variable in the discriminant model. The method used to select the descriptors were based on the Fisher-Snedecor parameter (F) [37], which determines the relative importance of candidate variables. The topological variables input is chosen in a stepwise manner; at each step, the variable that makes the largest contribution to the separation of the groups is entered into the discriminant equation (or the variable that makes the smallest contribution is removed).

The validation of the selected function was done using an external test set. Compounds that comprise the test set were randomly selected from approximately 20% of the data.

Another important parameter that usually provides a balanced evaluation of the model’s prediction is the Matthews correlation coefficient (MCC) [38]. This coefficient is based on the fact that in any prediction process there can be four different possibilities to account for:TP: true positive, a drug-induced dyspnea correctly classified or predicted.FP: False positive, a drug not inducing dyspnea predicted as induced or when there was none to predict.TN: True negative, a drug not inducing dyspnea correctly classified.FN: False negative, a drug-induced dyspnea predicted as not inducing or when there was none to predict.

It is clear therefore, that any single number that represents the predictive power of the method must account for all of the possibilities listed above. One such factor is the Matthews correlation coefficient, which is given by: The Matthews correlation coefficient ranges from −1 ≤ MCC ≤ 1. A value of MCC = 1 indicates the best possible prediction, in that every drug and all drug inducting dyspnea are correctly predicted. A MCC = −1 indicates the worst possible prediction (or anticorrelation), where no one dyspneagenic drug is detected, whereas all the nondyspneagenic drugs are erroneously predicted as dypsneagenic. Finally, a Matthews correlation coefficient of MCC = 0 indicates a random prediction.

Furthermore, a receiver operating characteristic curve (ROC) was drawn to evaluate the accuracy of the selected DF through the sensitivity (true positive fraction) and specificity (true negative fraction) for different DF thresholds. ROC curve is the representation of sensitivity versus (1-specificity) (false positive fraction). The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. Accuracy is measured by the area under the ROC curve, AUC [39, 40]. An AUC represents a perfect test, whereas AUC of 0.5 represents a worthless test.

The model’s predictive power was also assessed by using internal and external validation tests. A cross-validation, as an internal validation, was carried out by changing the roles of randomly 15–20% of active and inactive compounds from the training to the test set. Later on, it is checked if the model continues to show a good classification rate of the remaining compounds in the training and test set or not.

The equation obtained for the training set is used to predict the corresponding values of the test set.

2.4. Arrangement of the Pharmacological Distribution Diagram

The corresponding distribution diagram PDD [41] was drawn for dyspnea. Such diagrams are graphic representations providing a straightforward way to visualize the regions of minimum overlap between the active and inactive compounds, as well as the regions in which the probability of finding active compounds is maximum. Actually, a PDD is a frequency distribution diagram of dependent variables in which the ordinate represents the expectancy, , (probability of activity) and the abscissa represents the DF values in the range. For an arbitrary range of values of a given function, an expectancy of activity can be defined as , where “” is the number of active compounds in the range divided by the total number of active compounds, and “” is the number of inactive compounds in the interval divided by the total number of inactive compounds.The expectancy of inactivity is defined in a symmetrical way, as . Upon these diagrams, it is easy to visualize the intervals in which there is a maximum probability of finding new active compounds and a minimum probability of finding inactive compounds.

3. Results and Discussion

Appendix in Supplementary material shows the values of the indices for every compound conforming the training and the test sets. The discriminant function selected as the best one was where DF is discriminant function, ATS2e is Broto-Moreau autocorrelation of a topological structure at lag 2 weighted by atomic Sanderson electronegativities, MATS2v is Moran autocorrelation at lag 2 weighted by atomic van der Waals volumes, MATS8e is Moran autocorrelation at lag 8 weighted by atomic Sanderson electronegativities, MATS1p is Moran autocorrelation at lag 1 weighted by atomic polarizabilities, is Number of data compounds, is Wilks’ lambda, is Fisher-Snedecor parameter, and is Statistic significance.

From this equation, a given compound will be selected as active, that is, as a potential producer of dyspnea, if ; otherwise, it is classified as inactive. The classification matrix for DF is very significant for the training set: 82% of correct prediction for the active group, that is, 27 out of 33 compounds, and 55 out of 68 (81%) for the inactive group (see Table 1).

As pointed above, an additional way to check the model’s predictive capability is through the Matthews’ coefficient, which returns a range of values between −1 and . The higher its value the more reliable is the model. However, we calculated the Matthews’ coefficient in a slightly different way, that is, just adding to each threshold value, so that the outcome is expressed as % accuracy. In other words, 0 would mean no correlation at all, 1 represents 50%, and 2 accounts for 100% correlation. By doing so, the output was 83% of accuracy (modified Matthews’ coefficient = 1.66).

An internal cross-validation (CV) analysis was also carried out to check the DF quality. Table 2 shows the values of (Wilks’ lambda) and the classification matrix for compounds in the training and test sets. The values of λ for both sets are very close to each other. In fact, the values for the selected model and the average of the cross-validation model were , and , respectively. Moreover, the average percentage of correctness classification is also similar in both models (78% for the average CV and 77 %, for the selected model).

As can be seen in Table 2, the DF percentage of correct in the test set was 77%, what means that 23 out of 30 active compounds were correctly classified, while 31 out of 45 (69%) were correct in the inactive group (see Table 2). Table 3 outlines the results of the prediction for every compound of the external test.

Regarding the interpretation of results, it is noteworthy the presence of the autocorrelation indices in the DF (4). Autocorrelation indices enable the representation of the molecule as a vector that can be compared with the vectors of other molecules at a given position or lag (lag = 1,2,3…). This singular molecular frame has a further desirable property; it is independent of the orientation of the molecule; in other words, it is a real topological index or graph invariant. This approach has also the advantage that there are many different ways to define a particular alignment of molecules (structure, shape, and function) although, contrary to other topological indices, they do not enable the interconnection between the vector values and the original molecules; that is, they are not bijective. In addition, autocorrelation indices allow weighting the graph vertices by different parameters such as electronegativity and atomic mass.

In our case, the indexes at (4), namely, ATS2e, MATS2v, MATS8e, and MATS1p, are weighted by polarizabilities, van der Waals volumes, and electronegativity. Moreover, the first and third indices (ATS2e and MATS8e) contribute positively to dyspneagenesis, while the second and forth (MATS2v and MATS1p) contribute negatively. This indicates that, roughly speaking, molecules containing big and highly polarizable heteroatoms (such as S and Cl) as, for example, Mitotane and Biotin (see Figure 3) would be in general less dypsneagenic, whereas molecules containing highly electronegative atoms, such as fluorine, would be more dypsneagenic, as, for instance, Capecitabine, Clofarabine, and Ticagrelor.

It can be noticed that molecules showing dyspnea as side effect usually have saturated allicyclic rings (such as Piperazine or Morpholine), as, for example, Irinotecan, Tobramycin, Dipyridamole, and Sildenafil (see Figure 4), which are isolated from each other or separated from aromatic rings if exist. On the contrary, the inactive molecules exhibit in general many aromatic rings and show high conjugation, as, for instance, Alclofenac and Mitotane (see Figures 3 and 4).

A very basic structural factor, the number of branches (vertices with valence 3 or 4 in the simple graph) play also a significant role. There seem to be a threshold of about 9 branches separating the active from the inactive compounds. Indeed, the low branching molecules (number of branches below 9) are, in more than 80% of cases, inactive, as, for instance, Amifostine, Estazolam, Nitric Oxide, and Nitroglycerin… (see Figure 5). On the contrary, highly branched compounds (number of branches above 9) may be either active or not, as for example, Tobramycin, Vinorelbine, and Atovaquone (see Figures 4 and 5). In other words, a branching threshold above 9 is a necessary but not a sufficient condition.

In summary, we find the following features regarding the active-inactive compounds, related to TIs in the model and visual analysis of the structures in the data set.(a)Above 80% of compounds not showing dypsnea as a side effect show less than 9 branches in their simple graphs. (b)The dypsneagenic molecules usually have saturated heterocyclic rings (such as piperazine or morpholine) while compounds not exhibiting dyspnea in general exhibit many aromatic rings with a high level of conjugation.(c)The presence of highly polarizable (MATS1p) and large Van der Waals volume atoms (such as S or Cl) (MATS2v) diminish the dyspnea effect.

One of the most interesting consequences of the QSAR analysis we have just described is that it can be applied as a filter to avoid selecting dypsnea-inducing drugs. A good way to proceed is to use the pharmacological distribution diagrams (PDDs). Figure 6 shows the PDD obtained for our data set. As can be seen from the diagram, those compounds with DF values between 0.5 and 5.5 are clearly dyspnea inducers, those between 0.5 and −7 are generally dyspnea not inducers. Finally, are those compounds located between −3 and −2.5 are uncertain (not classified). Of course, if we are trying to identify possible compounds showing dyspnea as side-effect, we must search in the DF ranges between −3 to −2.5 and 0.5 to 5.5. The drugs outside these ranges are not dyspnea inducers. As it is arranged, the model enables not missing any potential dypsneagenic; that is, in this case sensitivity was preferred over specificity, just to prevent the risk of dyspnea.

Receiver operating characteristic curve for the training set is shown in Figure 7 for (4). The area under the curve (AUC) is 0.9046, which accounts for the high accuracy of model.

4. Conclusions

The results outlined in this work clearly point toward the efficacy of molecular topology in the prediction of a very important drug side effect: the induction of dyspnea. As far as authors’ knowledge, this is the second time that molecular topology has been applied to the identification of drug side effects [29]. Furthermore, it is the first time that dyspnea, as a side effect, has been so accurately predicted by a mathematical model, what opens the pathway to the design of drugs free from this undesirable side effect.

Acknowledgment

The authors thank the Ministerio de Economía y Competitividad, Spain (project no. SAF2009-13059-C03-02) for support of this work. M. Gálvez-Llompart acknowledges the “V Segles” Fellowship provided by the University of Valencia to carry out this study.

Supplementary Materials

Annex I: Molecular descriptors for each compound used in the training set and conforming the discriminant function (Eq. 1).

Annex II: Molecular descriptors for each compound used in the test set and conforming the discriminant function (Eq. 1).

  1. Supplementary Tables