Abstract

Drug repositioning offers new clinical indications for old drugs. Recently, many computational approaches have been developed to repurpose marketed drugs in human diseases by mining various of biological data including disease expression profiles, pathways, drug phenotype expression profiles, and chemical structure data. However, despite encouraging results, a comprehensive and efficient computational drug repositioning approach is needed that includes the high-level integration of available resources. In this study, we propose a systematic framework employing experimental genomic knowledge and pharmaceutical knowledge to reposition drugs for a specific disease. Specifically, we first obtain experimental genomic knowledge from disease gene expression profiles and pharmaceutical knowledge from drug phenotype expression profiles and construct a pathway-drug network representing a priori known associations between drugs and pathways. To discover promising candidates for drug repositioning, we initialize node labels for the pathway-drug network using identified disease pathways and known drugs associated with the phenotype of interest and perform network propagation in a semisupervised manner. To evaluate our method, we conducted some experiments to reposition 1309 drugs based on four different breast cancer datasets and verified the results of promising candidate drugs for breast cancer by a two-step validation procedure. Consequently, our experimental results showed that the proposed framework is quite useful approach to discover promising candidates for breast cancer treatment.

1. Introduction

Developing and discovering a new drug is a very costly and time consuming process, which can take 10–17 years with a cost of 1.3 billion dollars. Despite large investments in research and development each year, there are still only a small number of new drugs approved successfully by the Food and Drug Administration (FDA) each year. Increasing failure rates, high costs, and the lengthy testing process for drug development have led to a process called drug repositioning [1], which refers to identifying and developing new uses for existing drugs to reduce the risk and cost.

Traditional drug repositioning methods primarily use information on chemical structure, side effects, and drug phenotypes and explore similar drugs based on the assumption that structurally similar drugs tend to share common indications [24]. In other words, the key idea behind these approaches is that molecularly similar drug structures often affect proteins and biological systems in similar ways [4]. For example, Swamidass [5] used chemical structure data to identify unexpected connections between a known drug and a disease and explored the hypothesis that if a drug has the same target as a known drug, then this new drug would also have activity against the disease. As another approach, Keiser et al. used 3665 US FDA-approved and investigational drugs that together had hundreds of targets, defining each target by its ligands. The chemical similarities between the drugs and ligand sets predicted thousands of unanticipated associations, which have been used to develop new indications for many drugs.

Alternatively, some approaches use a drug phenotype, which is the expression profile of patients undergoing treatment with a drug. For example, the Connectivity Map (CMap) [6, 7] project is exploring the effects of a large number of FDA-approved chemicals (1309 drugs) on gene expression, and these effects are measured in four different cell lines, allowing researchers to analyze the different expression patterns of drug’s target genes. Many computational approaches have been introduced to reposition drugs using CMap by analyzing drug-associated expression signatures to match a repositioned drug’s effect with a shared perturbed gene expression profile for another disease, under the assumption that drugs that share similar CMap expression signatures have similar therapeutic applications. Using the CMap data, Iorio et al. [8] developed a drug repositioning method by constructing a drug-drug similarity network using gene set enrichment analysis (GSEA) [9] that could compute the similarity between pairs of drugs. Several different studies [3, 1013] showed that using CMap expression profiles with a combination of various data sources such as drug target databases, drug chemical structures, and drug side effects was an improvement over the current drug target identification methods.

Moreover, the rapid developments in genomics and high-throughput technologies have produced a large volume of disease gene expression profiles, protein-protein interactions, and pathways. The high-level integration of these resources using network-based approaches is reported to have great potential for discovering novel drug indications for existing drugs [14]. For example, Chen et al. [15] introduced two different inference methods for predicting drug-disease associations based on basic network topology using a bipartite graph constructed from DrugBank [16] and Online Mendelian Inheritance in Man (OMIM) [17]. Emig et al. [18] integrated gene expression profiles, drug targets, disease information, and interactions for drug repositioning. Hu and Agarwal [19] created a disease-drug network using disease microarray datasets and predicted new indications for existing drugs using their disease-drug network.

Although many of the above methods have shown encouraging results for finding new indications for old drugs, there are still some limitations. For example, Yildirim et al. [20] concluded that most drugs with distinct chemical structures target the same proteins, and Keiser et al. [21] reported that structurally similar drugs may also target proteins with dissimilar functions, stating that using chemical structure alone is insufficient for successful drug repositioning [22]. In addition, care should be taken when using only the drug phenotype (drug treated) expression profile (such as CMAP) for drug repositioning because some portion of the genes or pathways that show statistically significant expression differences in cell lines treated with the drug may be expressed only because of the drug’s side effects or toxicity. Furthermore, the genes expressed in the drug treated profiles for specific disease cell line or tissue only represent a small subset of the biological pathways, whereas the cooperation of genes plays an important role in complex diseases such as cancer. Pathway-based drug repositioning may be a better alternative for drug repositioning for specific diseases such as cancer.

To overcome the above limitations, the current drug repositioning methods require a comprehensive and efficient computational drug repositioning approach that incorporates powerful machine learning approaches using the high-level integration of available data such as disease gene expression profiles (disease profile), drug treated expression profiles (drug phenotype profile), and drug databases (STITCH [23], DrugBank [16], therapeutic target database (TTD) [24]) to discover new drugs for a human diseases. In this study, we aim to develop a systematic computation framework that repositions drugs by employing disease profile and drug phenotype profiles on the drug network along with integrated omics data.

2. Materials and Methods

In the framework as shown in Figure 1, we firstly identify disease-specific pathways by using an integrative analysis of multiple disease gene expression profiles and construct a pathway-drug network structure using pathway-drug associations derived from the CMap drug phenotype profile. Then to discover promising candidates, for drug repositioning, we initialize node labels for the pathway-drug network using identified disease pathways and known drugs associated with breast cancer and perform network propagation in a semisupervised manner.

In the following, the detailed explanations of our proposed framework for repositioning and evaluation method are described.

2.1. Finding Disease-Specific Pathways from Multiple Disease Expression Profiles

To identify disease pathways related to a specific disease, conventional approaches have usually focused on identifying enriched pathways between cases and controls using data from a single experiment. Specifically, when using real experimental data such as microarray gene expression data, it is possible for different studies to report different results for disease-specific pathways. That is, the results are often not reproducible or not robust even to the mildest data perturbation, so the integrated analysis of multiple existing studies can increase the reliability and generalizability of results [25]. To address these issues, our approach identifies a disease-specific pathway based on disease pathway enrichment using multiple gene expression profiles for a given phenotype, in which the disease pathway enrichment results are integrated. Each disease expression profile is preprocessed, and the pathways that show significant differences between case and control samples are identified by GSEA [9], which returns the enrichment score (ES) and nominal value for each pathway. These scores are used for comparison analysis across pathways to detect significant pathways.

Here, we considered that the integration of pathways significantly enriched for each expression profile could better represent “disease-specific pathways” for the phenotype of interest. To integrate, the pathways with a nominal value less than 0.01 () are selected as significant pathways for each expression profile, and their union is defined as “disease-specific pathways.” Figure 2 presents an illustration of the integration process.

2.2. Deriving Pathway-Drug Associations from CMap Drug Phenotype Profiles

To define a pathway-drug association, pathway-drug enrichment is established from the drug phenotype expression profile (CMap: Connectivity Map) [6, 7], which contains the gene expression profiles obtained from five different cancer cell lines treated with 1309 (v2) small drug molecules, most of which are FDA-approved drugs, for a total 6100 data points representing gene expression results with control vehicle samples. The CMap data are preprocessed, batch effects are removed, and pathway enrichments are estimated by GSEA as in previous studies [11, 26, 27]. As a result, each pathway (1077) has an ES for each drug molecule (1309). The strength of the ES indicates the association degree of a pathway with a drug. As shown in Figure 3, the pathway-drug association can be represented as a 1077 × 1309 matrix, where the columns list the drugs and the rows list the pathways.

2.3. Pathway-Drug Network Construction

A pathway-drug network was established from the drug pathway association profile. By using the pathway-drug enrichment matrix (Figure 3), the pathway-drug bipartite graph structure was constructed, whose vertices can be divided into two disjoint sets: (pathways) and (drugs) such that every edge with weight represents the enrichment of pathway by drug . In other words, each node in the network corresponds to a drug or pathway, and each edge corresponds to the association between them. It can be observed that drugs tend to bind with disease-specific pathways. All nodes were initially unlabeled as 0. Semisupervised learning on a network requires a small amount of labeled data with a large amount of unlabeled data.

To use the constructed bipartite graph for drug repositioning, we made following assumption as in [4]: If pharmacologically different drugs induce the same phenotype of interest, then most of molecular pathways they target must be shared. In other words, drugs used to treat the same disease (phenotype) target similar pathways. For example, if we have some prior knowledge on certain drugs that are used to treat a specific disease, then most of the molecular pathway they target should be similar. In Figure 4, the blue drugs (breast cancer treatment drugs) target pathway “B,” and the green drugs (prostate cancer treatment drugs) target pathway “D.” From this information, it is can be concluded that drug “K” can likely be used to treat prostate cancer, when the weight (ES) is high enough. This is main assumption that we make in our proposed framework for pathway-based drug repositioning. Defining the initial knowledge (or initial labels for nodes) is also one of the key steps in this work.

2.4. Label Initialization on a Pathway-Drug Network

To initialize the pathway-drug labels for the (pathways) and (drugs) disjoint sets, we used disease-specific pathways inferred from the multiple gene expression profiles and known treatment drugs for the given phenotype (breast cancer) were obtained from three different public resources: the Maya Clinic, Cancer Organization, and TTD. The identified disease-specific pathways were mapped to the U (pathways) set and labeled as 1, and the remaining pathways were labeled as 0.

For the (drugs) set, a more accurate prediction is possible if we can set the labels for the drug set in the pathway-drug network using previously known information about the disease-related drugs prior to using network propagation to predict drugs associated with the disease. Therefore, we first verified known drugs used for the treatment of the disease of interest using public drug-related sources, including the Maya Clinic database, Cancer Organization database, and TTD, and then determined the labels for the drug set in the pathway-drug network. These drugs were mapped to the (drugs) set and labeled as 1, and the remaining drugs were labeled as 0.

2.5. Drug Repositioning by Semisupervised Learning

Once the initial labeling of the pathway-drug network was completed, we predicted the repositioned drugs by learning the drug nodes and pathway nodes with the network propagation algorithm. The bipartite graph can be defined as , where and are the node sets that are the disjoint node, in which the nodes of each node set are expressed as and , respectively. is the set of edges between and , and represents the weights of these edges. The weight of a specific edge is expressed as . The function for the sum of all weight values for a node can be defined as Now, let us examine the network propagation algorithm based on the definition of the previously defined bipartite graph. First, the network propagation algorithm normalizes the weights of the bipartite graph using the following formula:Here, W is a matrix containing the weights of the bipartite graph, and are the diagonal matrices with the values of and , respectively, and is the matrix of the normalized weights. Second, network propagation is performed for the bipartite graph using formulae (2) and (3), iterating over the objective function of the graph-based semisupervised learning algorithm.

For each ,

For each ,Here, is the number of iterations and is the initial label of the corresponding node. The parameter α has a value between 0 and 1 and acts to regulate the relative weight of the initial label and the learned label. and are the initial labels for the drugs and pathway, respectively, whereas and are the final label scores. Finally, network propagation is completed when the values of and converge.

If the network propagation algorithm is executed over the pathway-drug bipartite according to the above method, the learned drugs label scores can be obtained. As the label score of a drug increases, the drug can be considered a more promising candidate for drug repositioning for the given phenotype. Therefore, we define the values of the final drug label scores as the drug repositioning scores and use them to predict disease-associated drugs from the pathway-drug network. In addition, all obtained label scores are normalized by the -score using the following equation:where is the label score vector for all drugs and is the final label score for . For each drug, the corresponding value was estimated based on the -score for Gaussian distribution. For more conservative results, we chose drugs with as promising drug candidates for drug repositioning for the given disease. The selected promising drug candidates are evaluated by our validation methods and chosen for further investigation.

3. Results and Discussion

We tested our proposed framework to reposition 1309 drugs for breast cancer.

3.1. Finding Disease-Specific Pathways in Breast Cancer

To obtain breast cancer-specific pathways, we used publicly available breast cancer expression profiles (GSE15852 [28], GSE20437 [29], GSE2043 [30], and GSE2990 [31]) from the Gene Expression Omnibus (GEO) [32]. Table 1 shows the detailed characteristics of the expression profiles used in our study. Each dataset was preprocessed using RMA techniques [33] and implemented in R using the BioConductor package, which includes a large number of metadata packages appropriate for different types of microarrays. Supplementary Figure  1, in Supplementary Material available online at http://dx.doi.org/10.1155/2016/7147039, shows the results of preprocessing. For each dataset, the corresponding annotation databases were downloaded separately, and each probe was mapped to a HUGO [34] gene symbol; a probe was discarded if it did not match any symbol. In addition, if a gene had multiple probes (many-to-one), the gene expression values were averaged over the probes.

The human metabolic and signaling pathways were obtained from the Molecular Signature Database (MSigDB) [35]. As shown in Table 2, we chose the canonical pathways in the curated gene sets that contain 1077 pathways collected from KEGG [36], Reactome [37], and BioCarta (http://www.biocarta.com/).

For each dataset, a pathway was defined as breast cancer enriched by GSEA when . To integrate, the enriched pathways with nominal values less than 0.01 () were selected as significant pathways for each expression profile, and their union was defined as the “disease-specific pathways.” Table 3 shows the number of enriched pathways for each dataset and the integrated pathways obtained by taking their union. Table 4 shows an example of enriched pathways in breast cancer by using experiment dataset (GSE2990). In the Supplementary Material, Tables 14 provide the GSEA analysis results for each cancer expression profile and list the identified disease-specific pathways that were used for label initialization on the pathway-drug network.

3.2. Breast Cancer Drug Repositioning Using the Proposed Approach

From the four different breast cancer expression profiles, 143 pathways were identified as significantly enriched. On the pathway-drug network, these pathways were mapped to the (pathways) set and initially labeled as 1, and the remaining 934 pathways were labeled as 0. In addition, known drugs used for the treatment for breast cancer were obtained from three different public resources, the Maya Clinic, Cancer Organization, and TTD. Sixty-one drugs approved to treat breast cancer were obtained from the Maya Clinic, 49 drugs were obtained from the Cancer Organization, and 11 drugs were obtained from TTD. Next, after mapping these drugs to the drug pathway network only 10 drugs were successfully mapped. Moreover, the 10 mapped drugs (tamoxifen, letrozole, doxorubicin, vinblastine, exemestane, aminoglutethimide, methotrexate, paclitaxel, megestrol, and fulvestrant) were labeled as 1 on V (drugs), whereas all remaining drugs (1299) were labeled as 0.

Once the initial labels of the pathway-drug network were chosen, we predicted promising candidates related to breast cancer using semisupervised network propagation, as shown in Figure 5. As a result, we considered 17 drugs with , as shown in Table 5, and found that 10 of them are already known drugs. The remaining seven drugs were considered as promising drug candidates for breast cancer and used for further validation to examine their association with breast cancer.

3.3. Validation of Promising Candidate Drugs

To validate the predicted drugs, we recommend the use of two different methods. Drugs that have been successfully validated by both methods are considered to be confirmed for repositioning for breast cancer.

3.3.1. Biological Validation

Biological validation was performed by manually checking the evidence in the biological literature on promising drug candidates. We manually searched for any possible indication of the repositioned drugs for breast cancer. As shown in Table 6, for each promising drug candidate, several different lines of evidence in the literature were found indicating its possible use for breast cancer. Based on these results, we concluded that six drugs of seven drugs were confirmed by biological validation for their new usage in breast cancer treatment, with phenoxybenzamine not being confirmed.

3.3.2. Computational Evaluation on the Validation Network

In drug repositioning, it is difficult to compare and evaluate the performances of computational methods. To address this issue, several recent studies have focused on curating a comprehensive and public catalog of existing drug indications using a manual process [4].

Therefore, to develop a better evaluation method using computational methods, a validation network was constructed using information on three different relationships, drug-drug, drug-gene, and gene-gene, from the STITCH and STRING databases [38]. The drug-drug relationship information was obtained from the STITCH (v4) [39] database, which contains data on the interactions between small molecules and the edges between two chemicals that are expressed using a score between 0 and 900 defined from the chemical similarity between drugs. The drug-gene network was constructed from STITCH (for human) protein-chemical interactions with the help of the STRING database which provides 4,523,609 relationships for humans with the correlations between proteins and chemicals recorded as scores using information obtained from experimental results, text-mining, or predicted correlations. The gene-gene network was constructed from the STRING database, where A PPI network can be described as a complex system of proteins linked by interactions. Two proteins or genes that physically interact are represented as adjacent nodes connected by an edge. Each protein id (unipro id) is converted to the corresponding gene symbols using annotation databases provided in the STRING protein-protein interaction database. For computational evaluation, we have selected a maximum of 40 neighbors of drugs (17 drugs) with a weight criterion of from the validation network derived from STITCH. The constructed validation network is illustrated in Figure 6.

To investigate the node properties in a network, network topology measurements (degree centrality and betweenness) and linkage analysis (PageRank) are often used. Degree centrality represents the number of interactions/edges/connections for a node. Biological networks are mostly scale-free networks, in which most nodes have few edges and a small number of nodes (hub) have a very high degree centrality. Betweenness is measured by the shortest paths between all nodes in the network and nodes that have the “shortest path” going through them are called bottlenecks. These hub and bottleneck nodes are topologically important and are usually functionally essential nodes (genes and drugs that have significant biological roles). Nodes connected to the hub and bottleneck node directly can also be functionally important. In addition, link analysis is a technique used to evaluate relationships (connection weights). The PageRank is a popular link analysis algorithm based on idea that a node should be significant if other significant nodes contain links to it.

By answering the following biological questions for the promising drug candidates, we identified the most promising drugs among them.(i)Which candidate drug has an interesting/important relationship (connections) with known drugs?(ii)Which candidate drug has the hub/bottleneck property on the validation network?(iii)Which candidate drugs are connected to known breast cancer target genes?For this purpose, we checked the network properties of promising drug candidates on the validation network using degree centrality, betweenness, and PageRank. Among them, the network topology measurements (degree centrality and betweenness) are designed to produce a ranking which allows indication of the most important vertices and not designed to measure the influence of neighbor nodes in general. Therefore, for better validation of promising candidates on validation network, PageRank algorithm seems to be more preferable which evaluates the nodes by considering their connection weights to the influential neighbors nodes.

From the results shown in Table 7, the popular breast cancer drug “tamoxifen” was identified as the most important hub node with degree centrality of 0.661 on the validation network. Among the promising drug candidates, camptothecin showed the hub node property with the highest degree centrality (0.232) among the other five (MS-275, GW-8510, phenoxybenzamine, tyrphostin_AG-825, and alsterpaullone). Table 8 shows the neighbor nodes of the camptothecin on the validation network where it has a strong chemical similarity with the known drugs doxorubicin, paclitaxel, vinblastine, and methotrexate. A close look at this relationship is shown in Figure 7(a), and this evidence seems to point to the possibility of using the camptothecin for breast cancer treatment because structurally similar drugs usually bind the same disease targets. In addition, from Table 8 and Figures 7(a) and 7(b), it can be seen that camptothecin has a strong target relation with the genes that play active role in breast cancer including TOP1, ABCB1, TOP2A, CASP3, and TP53 (neighbors) and EGFR (second-degree neighbor). TOP1 and TOP2A were reported to inhibit the breast cancer resistant proteins [40]. ABCB1 is known as prognostic factor in breast cancer patients [41]. CASP3 expression loss represents an important cell survival mechanism in breast cancer patients [42] and it inhibits the growth of breast cancer cells. EGFR was one of the first identified important targets in breast cancer, and half of breast cancer cases overexpress EGFR.

The candidate drugs MS-257 and alsterpaullone showed relatively higher degree centrality values among the remaining drugs. Table 9 and Figure 8 show the neighbor nodes relationship of MS-257 on the validation network, where it has strong target relationships with the genes HDAC1, TP53, CASP3, CCND1, and CYP3A4. Overexpression of HDCA1 represents clinicopathological indicators of disease progression in human breast cancer [43]. CCDN1 was reported to be a therapeutic target in breast cancer [44], and it has an indirect relationship with breast cancer susceptibility gene BRCA1. The betweenness results are summarized in Table 10. Among promising drug candidates only camptothecin and MS-275 showed some bottleneck node properties. Tamoxifen was defined as the most important bottleneck drug for breast cancer. Finally, we evaluated the connection weights of candidate drugs on the validation network using PageRank algorithm. We chose the alpha parameter as 0.85, which is the most commonly used value for this parameter with original Google PageRank algorithm. As shown in Table 11, camptothecin (0.257), alsterpaullone (0.102), and MS-275 (0,088) exhibited higher ranking scores than the other promising candidate drugs.

From the evidences shown above, we concluded that camptothecin, MS-257, and alsterpaullone exhibited the strongest network property evidences for breast cancer on the validation network. In general, all of the promising candidates successfully passed the computational evaluation on the network.

After performing biological and computational evaluations of the promising candidate drugs, we selected camptothecin as the most promising candidate because it was the most successful in both evaluation processes. For MS-278, GW-85, AG825, alsterpaullone, and celastrol, there was strong literature evidence with a reasonable network property. Thus, as shown in Figure 9, camptothecin, MS-278, alsterpaullone, GW-85, and AG825 and were validated as repositioned drugs and indicated for further investigation in breast cancer treatment.

4. Summary

We introduced a new systematic framework for disease-specific drug repositioning from integrated gene expression profiles on a pathway-drug network constructed from drug phenotype expression profiles (CMap) using semisupervised learning. The proposed pathway-based drug repositioning process showed encouraging results when using four different disease expression profiles to predict candidate drugs for disease-specific repositioning.

Two different methods were employed to evaluate the repositioned drugs. The drugs that passed both evaluation methods successfully were considered the most promising drugs to target breast cancer. As a result, several drugs, including camptothecin, MS-275, alsterpaullone, GW-8510, AG 825, and celastrol were identified as possible drugs to be repositioned to treat breast cancer, and these results are supported by multiple lines of evidence in the public literature. Specifically, camptothecin was the most promising drug candidate because it showed a high network property on the validation network and was supported by evidence in the literature.

Despite the interesting results, our method for drug repositioning was developed and validated in only using integrated mRNA gene expression profiles. However, the strategy can be easily improved to include other experimental data types, such as RNA-seq, miRNA, DNA-methylation, and single nucleotide polymorphism (SNP) information. Finally, the increasing number of genomic and pharmaceutical databases necessitates the further development of the method to identify new drugs and targets for rare cancer subtypes, develop personalized medicine, and design targeted cancer therapies.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This study was supported by the BK21 Plus project funded by the Ministry of Education, Korea (21A20131600011).

Supplementary Materials

Supplementary file 1 contains Figures (1 and 2) of the preprocessing results and enrichment heatmaps  for each datasets.

Supplementary files (2, 3, 4, and 5) contain enrichment analysis results of each datasets in tab-limited format.

  1. Supplementary File 1
  2. Supplementary File 2
  3. Supplementary File 3
  4. Supplementary File 4
  5. Supplementary File 5