Computational Data Mining in Cancer Bioinformatics and Cancer EpidemiologyView this Special Issue
Integrating Diverse Information to Gain More Insight into Microarray Analysis
Microarray technology provides an opportunity to view transcriptions at genomic level under different conditions controlled by an experiment. From an array experiment using a human cancer cell line that is engineered to differ in expression of tumor antigen, integrin , few hundreds of differentially expressed genes are selected and are clustered using one of several standard algorithms. The set of genes in a cluster is expected to have similar expression patterns and are most likely to be coregulated and thereby expected to have similar function. The highly expressed set of upregulated genes become candidates for further evaluation as potential biomarkers. Besides these benefits, microarray experiment by itself does not help us to understand or discover potential pathways or to identify important set of genes for potential drug targets. In this paper we discuss about integrating protein-to-protein interaction information, pathway information with array expression data set to identify a set of “important” genes, and potential signal transduction networks that help to target and reverse the oncogenic phenotype induced by tumor antigen such as integrin . We will illustrate the proposed method with our recent microarray experiment conducted for identifying transcriptional targets of integrin for cancer progression.
A micro-array experiment is conducted to study expression profiles of genes in a specimen under different experimental conditions, or over several different time periods. It serves many purposes that include (1) developing a predictive computational model which can be used to predict biomarkers and targets for cancer therapy, (2) gaining some insight on gene regulation when a microarray experiment is conducted in different time points, (3) gaining insight on the genes that may be involved in a situation or disease under investigation, (4) understanding or refining protein-in-protein interaction networks, and (5) annotating uncharacterized genes. In a recent review article on the applications of microarray, Troyanskaya  provides some details on the items 2, 4, and 5. Statistical tests are conducted to filter valid signals first and then a subset of genes called differentially expressed genes is selected based on their relative strength or weakness of expression levels with respect to their reference expression values. The differentially expressed probes, which roughly correspond to genes, are reduced to few hundreds while the total number of probes of an experiment is in the order of 20 to 50 thousands.
The set of highly expressed genes are considered to be candidates for biomarkers in a microarray experiment. It is quite difficult to single out the best biomarkers by viewing expression level alone partially due to noise or some association by “guilt.” By integrating microarray expression data with other information pertaining to the protein behavior we can improve the quality of decision on biomarkers as has been proposed by Camargo and Azuaje in . Similarly we vcan gain better insight into gene regulation by associating gene expression with protein interaction network with known cancer related pathways.
A significant volume of works has been done that relates or combines microarray data sets and protein-to-protein interaction networks. Based on the expected outcome, these works may be characterized into (1) annotating uncharacterized genes, (2) refining protein-to-protein interaction network, (3) predicting protein to protein interaction, and (4) refining potential biomarkers from array expression. Integrating protein interaction network information with expression data sets along with other information pertaining to a gene has been used [3–7] for annotating uncharacterized protein. In the recent work of Nariai et al. , probabilistic approach has been used to integrate protein to protein interaction, array expression, protein motif, gene knockout phenotype data, and protein localization data for predicting the function of an uncharacterized genes.
Microarray expressions data has also been used for refining protein to protein interaction networks. Zhu et al.  have used coexpressed genes from microarray data set to filter the neighbors of protein in an interaction network to enhance the degree of functional consensus among the neighbors.
Array expression data sets are used for predicting protein to protein interaction [9, 10]. Recently Soong et al.  have used microarray expression to predict protein to protein interaction. A pair of proteins is represented by a feature vector consisting of a concatenation of expression modes or profiles of those proteins along with the Pearson correlation of the expression profiles of these two proteins. They have demonstrated the predictability of using support vector machine with protein to protein interaction of yeast data sets from DIP  and 349 yeast microarray expression data sets from GEO .
Camargo et al.  have integrated array expression data set with expression data for refining potential biomarkers. Their work has some overlapping with our current approach in selecting hub nodes from interaction network and combining with array expression data sets. Their focus, however, was only on refining the biomarkers derived from array expression as opposed to providing insight into potential signal transduction pathways or any other intermediate activities that are not revealed in an array expression.
We take a different approach that compliments the strength of interaction data sets and array expression data sets. The array data sets capture the expression levels at different experimental conditions (or time points) while the information on interaction networks represents experimentally determined and as well as predicted interaction between pairs of proteins in a two-dimensional space without paying attention to the context, the temporal relations, or the process. By bringing two different types of modalities of information together, we believe we can discover some important genes that may have played important roles in the final observation of the array expression.
Suppose we consider a binary case of studying the expression pattern of a cell line of healthy and sick subjects. Examining the differentially expressed genes provides information on which genes are up-or downregulated, and their expression levels. This information alone does not provide insight into deciding interesting set of genes that are either taking part of the progression or the cause of the disease under consideration. We will show how to integrate gene expression with expression patterns with protein to protein interaction, and known genes in disease pathways to gain insight onto a small subset of interesting genes relevant to the disease under investigation.
To illustrate and to apply the idea of integrating microarray data with protein to protein interaction network, and disease related pathways, we use our recent microarray study for identifying transcriptional targets of integrin for cancer progression. Jun Chung and his associates have used the affymetrix HG-U133A_2 to identify transcriptional targets of integrin . The goal of the study is to identify transcriptional targets important for breast cancer progression. The integrin, an epithelial-specific integrin, functions as a receptor for the members of the laminin family of extra cellular matrix proteins [13, 14]. While the primary known function of is to contribute to tissue integrity through its ability to mediate the formation of hemidesmosomes (HDs), there is growing evidence suggesting that this integrin also plays a pivotal role in functions associated with cancer progression [13, 14]. For example, high expression of this integrin in women with breast cancer has been shown to correlate significantly with mortality and disease states [13, 14]. However, therapeutic targets of breast cancer that overexpress are not yet well characterized. For this reason, it is essential to elucidate the mechanism by which contributes to breast cancer progression.
2. Materials and Methods
We are focusing on genes of Homo sapiens and their expressions for this experiment. From Affymetrix site at http://www.affymetrix.com/, we have downloaded the annotations (HG-U133A_2.na22.annot) for the genes that are tested in a microarray experiment.
The gene expression data is from our recent microarray experiment using the affymetrix HG-U133A_2 to identify transcriptional targets of integrin . Our study here describes the gene expression profile obtained from MDA-MB-435 mock transfectants ( negative human cancer cell line) and MDA-MB-435 4 integrin transfectants ( positive human cancer cell lines). Out of oligonucleotide probe sets representing approximately 22 277 genes, expression of 4 integrin in MDA-MB-435 cells up regulated 149 genes by twofold or higher. 193 genes are down regulated by over two fold change. We anticipate that microarray data will lead to not only the identification of target genes that are important for breast cancer cell growth, survival, and invasion, but also the discovery of signaling pathways leading to the expression of these genes.
The protein to protein interaction databases include MIPS , DIP , BIND [16, 17], GRID and I2D . Noise is often a factor in many protein to protein interaction dataset. To minimize the noise and its impacts on the final outcome, we apply ensemble-based method for selecting the interaction. That is, by applying majority voting on interacting pairs from different the database, we can improve the accuracy and minimize the errors in their interaction information. I2D provides experimentally determined and predicted protein to protein interaction with easy to use interface, and thus we have downloaded I2D  for homo sapiens genome.
2.2. Data Preprocessing
Suppose we are gathering protein to protein interaction from different sources each with their own accuracy. By combining the results of independent test or source that has prediction accuracy over 50%, we can obtain prediction accuracy better than any one method alone. Suppose we have independent sources each with some predefined fixed prediction accuracy, say . Without loss of generality, let us assume is an odd number. By accepting the decision of majority predictors among , the combined accuracy is given by the following formula: where .
Suppose nine independent predictors each with prediction accuracy 0.65 are combined by majority votes, the combined prediction accuracy becomes 0.83.
I2D  collects and maintains protein to protein interaction from various sources and we have downloaded the interaction information pertaining to Homo Sapiens. By applying the majority votes, we have minimized some plausible noise in the data set.
The microarray experiment was repeated three times and in each repetition the expressions of genes under the following two conditions are measured: (1) integrin negative cell line (control), and (2) integrin positive cell line. Out of the 22 277 genes we have selected only 8512 genes that have valid signal in all measurements. The average of the log ratio between the integrin positive and the control expression in all the repetitions is taken as the expression of a gene. From the expressions, we could create different expression patterns based on the values such as up regulated fold changes over 2 to 3, 3 to 4, and over 4. Among the down regulated genes, we may have the similar groups. For simplicity, we have taken only two patterns, namely, up regulated and down regulated genes. The up regulated genes are those that have fold changes (log of the ratio 2) over 1 and the down regulated are those that have the fold changes (log of the ratio 0.5) less than .
We have downloaded human protein to protein interaction networks from I2D, which have 13 560 genes that have connectivity from 1 to 694. The connectivity or degree of a node is defined as the number of edges connected to the network and we consider each edge as bidirectional connection. As expected, the interaction follows the scale free distribution. For the purpose of integrating the interaction network with the microarray expression data set, we have extracted a subnetworks from the whole networks that interact with the differentially expressed genes from the experiment. The selected sub networks, which we refer to as G, have 2186 genes including the 190 differentially expressed genes, and 3130 edges. A view of Graph G is shown in Figure 1 as created by Navigator . The up and down regulated genes are shown in red and green, respectively, and the size of each node corresponds to the degree of interaction of that node in the graph.
In a typical microarray analysis, the differentially expressed genes are ranked based on their fold changes and the first few of them as taken as important. We feel that using expression fold change alone to determine the importance of a gene is quite weak. We take a different approach in this paper for discovering a set of important genes under a given experimental condition. We create the subgraphs, say G, of protein to protein interaction networks that is associated with the differentially expressed genes from the microarray experiment. It is generally believed that the connectivity of nodes in G roughly reflects the importance of the gene in the interaction . We found that even the network G has the property of a typical scale free network indicating only a small fraction of the node has large connectivity.
2.3.1. Selecting a Set of Important Genes Based on Topological Structure
In the recent work, Jeong et al.  and Twe et al.  have suggested that essential proteins are over represented among those proteins having high degree of connectivity, which can be attributed to the central role in mediating interactions among numerous, less connected proteins. Hub nodes in an interaction network are defined as a set of nodes with very high degree of interaction with neighbors and the corresponding threshold for connectivity is defined quite arbitrarily. Vallabhajosyula et al.  have studied the issue on selecting hub nodes and the impacts on their functional significance, but unfortunately they were unable to provide and prescriptive definition or method on selecting hub nodes. They, however, stated that the nodes with relatively high degree of interaction are likely to have very high functional significance. In the literature, we found that people have applied varying criteria in selecting the threshold for hub nodes; for example, Batada et al.  have defined hub nodes as those connect to over 90% or 95% of the nodes in the network. Biasing from the finding in  that the top few percentage of nodes with high degree of interaction has better functional significance, we selected the hub nodes; those that are in the top 3% of the nodes ranked based on the decreasing order of connectivity.
We also believe that important genes must also play a role in the stability of the network, that is, removal of such node will break the network into disconnected subnetworks. An articulation node in a graph plays the role of connecting or keeping the graph together and the removal of such node separates the graph into subgraphs. Thus the hub genes that play articulation role in an interaction network seem to have more functional significance.
A minimum spanning tree is acyclic graph that connects all the nodes in a network such that the summation of cost in all the edges is minimal and thus eliminates redundant paths among the nodes. A node with high degree of connectivity in minimum spanning tree will indeed play an important role. In a protein interaction network the edge cost is taken to be 1 and we construct a minimal spanning tree using Kruskal’s algorithm . We selected the hub nodes from the minimum spanning tree and consider them as important genes too.
As described above, three set of potentially important nodes can be selected from the following different methods: (1) hub nodes from the interaction networks, (2) hub nodes from the set of articulation nodes, and (3) hub nodes from the minimum spanning tree. The nodes satisfying condition 2 are indeed a subset of those satisfying condition 1 and hence we have only two distinct conditions, namely, 2 and 3. We define a set of important genes; those that satisfy either conditions 2 or 3.
2.3.2. Important Genes Based on Pathways and Interaction
Pandey Lab at the Johns Hopkins University and the Institute of Bioinformatics  maintains experimentally determined ten cancer signaling pathways for Homo Sapiens, namely, EGFR1, TGF, beta Receptor, TNF, alpha/NF-kB, Integrin, ID, Hedgehog, Notch, Wnt, AR, and Kit Receptor. We have obtained the genes in each of the ten cancer pathways and extracted sub network, say , from the interaction network that interacts with any genes in the cancer pathway. The important nodes of include the ones from the three following methods or sources.(1)Hub nodes of .(2)Hub nodes of the articulation nodes of .(3)Hub nodes of the minimum spanning tree created from .
The nodes satisfying condition 2 are indeed a subset of those satisfying condition 1 and hence we have only two distinct conditions, namely, 2 and 3. The important nodes related to cancer pathway are those that satisfy either condition 2 or 3.
Besides examining the important nodes in each graph, we can examine the cliques or near cliques for similar functional association of genes. Han et al.  along with many other researchers have used cliques or near cliques in an interaction network to find functional group of genes. A clique is a fully connected subgraph of a graph and find cliques in a network is computationally intractable. For many practical purposes, near cliques are computed.
From the microarray experiment, we have two different expression patterns, namely, up-and downregulated genes. The up regulated genes are those that have valid signal across three trials and have expression level over 2 times that of the reference gene. Similarly the down regulated genes are those that have valid signal across three trials and have inverse expression level over 2 with respect to the reference gene. We list the first 14 up and down regulated genes of our experiment in Table 1. We combined the gene expression with gene interaction by selecting subset of the interaction graph that associates with all the differentially expressed genes. The selected subgraphs, which we refer to as G, have 2186 genes including the 190 differentially expressed genes, and 3130 edges. Note that there is no single hub node among the 14 high ranking up regulated nodes of G. On the other hand, there are 5 hub nodes among the high ranking down regulated nodes. There seems to be no correlation among the hub nodes of an interacting graph with highly up or down regulated genes.
From the graph G, we select the set of important genes based on topological structure, which involves selecting the hub nodes and following the procedures described in the previous section. The cutoff connectivity for the hub nodes in G is 16 and there are 60 hub genes out of 2186 genes. Out of the 60 hub nodes, 49 are from the differentially expressed genes (12 of them are up regulated and the rest are down regulated). The graph G has 200 articulation genes and out of which 60 satisfy the hub condition (degree 16 or above). The minimum spanning tree of G was constructed assuming the edge cost is 1. The nodes with connectivity 9 or better in the minimum spanning tree satisfy the hub node property. The minimum spanning tree has 77 hub genes and out of which 17 of them are up regulated and 46 are down regulated. In agreement with conditions 2 and 3 in Section 2, 57 genes are selected as important ones out of which 12 are up regulated and 35 are down regulated. These genes are listed in Table 3.
To discover the important genes related to cancer, we have extracted a sub network, which we call , from G such that each node in is directly associating with any one of the genes in cancer pathways that include EGFR1, TGF, beta Receptor, TNF, alpha/NF-kB, Alpha6 Beta4, Integrin, ID, Hedgehog, Notch, Wnt, AR, and Kit Receptor pathways. The genes in these curated pathways for human are downloaded from their web portal . We found 24 nodes in the network with connectivity 12 or better satisfy the hub node property. The pathway related network has 132 articulation genes out of which 23 are hub genes. The minimum spanning tree of is constructed and the backbone of the minimum spanning tree is shown in Figure 2. The minimum spanning tree has 200 genes and 17 out of these genes have connectivity 4 or better satisfy the hub node property. By combining all these three set of hub genes using ensemble method, we have created the important genes related to pathways and are presented in Table 4.
Besides examining the important genes in , the cancer pathway related network, we searched for cliques or near cliques in the network to examine functionally related genes. The cliques from the network is shown in Figure 3.
The direct interaction among the genes identified as important nodes due to the known cancer pathways is shown in Figure 5.
4. Summary and Discussion
In this paper we have presented a general method for integrating microarray expression with other complementary information related to gene function so that we can understand and infer information about the set of genes that we are interested. Particularly we focused on integrating protein interaction information and pathway related information with microarray expression. We have applied the proposed general methodology to our recent microarray experiment to discover potential drug target that may lead to novel anticancer therapeutics.
Quite a large body of research works is done in integrating expression data with interaction network and other data sets. Many of the works fall into one or some combination of the following categories: (1) annotating uncharacterized genes, (2) refining protein to protein interaction network, (3) predicting protein to protein interaction, and (4) refining potential biomarkers from array expression. The presented work here has some overlaps with the recent work of Camargo et al. , which involved in integrating expression data set with expression data set for refining potential biomarkers of array expression and to annotate uncharacterized genes. They have used hub genes of the interaction network to refine biomarkers of the expression data sets.
The interaction network of Homo sapiens is scale free, that is there are few nodes having very high degree of interaction and facilitate other nodes in mediating their functions. Even the subnetwork of the interaction network that has direct interaction with differentially expressed genes is found to be having the properties of scale free network. Hub nodes in an interaction network are defined as a set of nodes with very high degree of interaction with neighbors and the corresponding threshold for connectivity is defined quite arbitrarily. Biasing from the finding in  that the top few percentage of nodes with high degree of interaction has better functional significance, we selected the hub nodes; those that are in the top 3% of the nodes ranked based on the decreasing order of connectivity.
From the Homo sapiens interaction network, we have extracted a sub network called G that is associated with the differential expressed genes of our microarray experiment. Hub nodes in an interaction network are important and we selected the first set of hub nodes from G. A set of articulation nodes, which plays the role of stability of the network, is also important. We selected a set of articulation nodes from G. We have constructed a minimum spanning tree from G and we have selected a set of hub nodes from the minimum spanning tree. We created important set of genes based on topological structure of the interaction network. The hub nodes alone in isolation do not reveal any useful information. Similarly the highly ranked up or down regulated genes by themselves do not provide any clue into any potential signaling pathways either.
On the other hand, when we combine the set of important genes based on the interaction topology from Table 3 and the set of highly expressed genes from Table 1, we started to get some insight into potential signal transduction pattern as shown in Figure 4. The highly expressed gene from the experiment NRCAM, neuronal cell adhesion molecule, is directly interacting with another gene NA (neurocanthocytosis) which is recognized as an important gene from the topology and mediating the down regulation of the following set of tumor suppression genes, CHEK1 , XRCC6 , SMARCB , and ATM . The gene NA acts as a hub gene among the set of important genes and it directly interacts with SMARCB and XRCC4, which directly interacts with CHEK1 which in turn directly interacting with ATM. It is notable that down regulation of these tumor suppressor genes by integrin has a significant implication in cancer biology. Poor prognosis has been associated with over expression of integrin and our analysis revealed that loss of these tumor suppressor genes could attribute to malignant phenotype of cancer cells.
Impact of this study lies in the identification and targeting molecular aberrations specific to cancer cells. Many recent studies with targeting a single agent turned out to be a disappointment. This could partly be due to the inability to identify signaling network or loop which is positively or negatively regulated around the single target. To meet this important challenge, a number recent studies are analyzing cancer cell lines and tissue samples to measure alterations at the gene, RNA, and protein level to identify markers and targets for the therapy. While these studies will produce a large amount of data whose analysis is critical in order to understand cancer at the molecular level. For example, a similar microarray analysis of MDA-MB-435 cells that are engineered to differ in integrin expression by Chen et al. leads to the identification of couple of invasion and metastasis related genes such as ENPP2  and S100A4 . What makes our study unique from these works is that we are in a position to identify genes and proteins that are functionally connected to drive malignant properties rather than focusing a single gene because targeting these sub networks will inhibit cancer cell functions important for progression. For example, we found the potentially important target genes associated with cancer pathway as summarized in Table 4. Those genes are associated with TGF- , TNF- , and EGFR1 pathways , whose roles in cancer progression have been well established.
In summary, the integration of interaction network with expression of integrin in MDA-MB-435 cancer cells reveals the importance of NRCAM, which we would not have discovered with the expression information alone. Further, the interaction network in Figure 4 helps us to understand how the tumor suppression genes CHEK1, XRCC6, SMARCB, ATM, CHEK1 were down regulated by integrin . Finally, we envision the discovery of interaction network triggered from tumor antigen such as integrin α6β4 will lead to the development of novel anticancer therapeutics by targeting signaling molecules associated with interaction network.
O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein, “A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae),” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 14, pp. 8348–8353, 2003.View at: Publisher Site | Google Scholar
G. R. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble, “Kernel-based data fusion and its application to protein function prediction in yeast,” in Proceedings of the Pacific Symposium on Biocomputing (PSB '04), pp. 300–311, Big Island of Hawaii, Hawaii, USA, January 2004.View at: Google Scholar
I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S.-M. Kim, and D. Eisenberg, “DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions,” Nucleic Acids Research, vol. 30, no. 1, pp. 303–305, 2002.View at: Google Scholar
A. M. Mercurio, R. E. Bachelder, I. Rabinovitz, K. L. O'Connor, T. Tani, and L. M. Shaw, “The metastatic odyssey: the integrin connection,” Surgical Oncology Clinics of North America, vol. 10, no. 2, pp. 313–328, 2001.View at: Google Scholar
R. C. Willis and C. W. Hogue, “Searching, viewing, and visualizing data in the Biomolecular Interaction Network Database (BIND),” Current Protocols in Bioinformatics, chapter 8: unit 8.9, 2006.View at: Google Scholar
- Google Scholar
K. L. Tew, X. L. Li, and S. H. Tan, “Functional centrality: detecting lethality of proteins in protein interaction networks,” Genome Informatics, vol. 19, pp. 166–177, 2007.View at: Google Scholar
N. N. Batada, T. Reguly, A. Breitkreutz et al., “Stratus not altocumulus: a new view of the yeast protein interaction network,” PLoS Biology, vol. 4, no. 10, p. e317, 2006.View at: Google Scholar
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, MIT Press, 2nd edition, 2001.
C. W. M. Roberts and J. A. Biegel, “The role of SMARCB1/INI1 in development of rhabdoid tumor,” Cancer Biology and Therapy, vol. 8, no. 5, pp. 412–416, 2009.View at: Google Scholar
M. Chen, M. Sinha, B. A. Luxon, A. R. Bresnick, and K. L. O'Connor, “Integrin controls the expression of genes associated with cell motility, invasion, and metastasis, including S100A4/metastasin,” The Journal of Biological Chemistry, vol. 284, no. 3, pp. 1484–1494, 2009.View at: Publisher Site | Google Scholar