Big Data and Network BiologyView this Special Issue
An Unsupervised Approach to Predict Functional Relations between Genes Based on Expression Data
This work presents a novel approach to predict functional relations between genes using gene expression data. Genes may have various types of relations between them, for example, regulatory relations, or they may be concerned with the same protein complex or metabolic/signaling pathways and obviously gene expression data should contain some clues to such relations. The present approach first digitizes the log-ratio type gene expression data of S. cerevisiae to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change, and highly suppressed conditions for genes, respectively. For each gene pair, a probability density mass function table is constructed indicating nine joint probabilities. Then gene pairs were selected based on linear and probabilistic relation between their profiles indicated by the sum of probability density masses in selected points. The selected gene pairs share many Gene Ontology terms. Furthermore a network is constructed by selecting a large number of gene pairs based on FDR analysis and the clustering of the network generates many modules rich with similar function genes. Also, the promoters of the gene sets in many modules are rich with binding sites of known transcription factors indicating the effectiveness of the proposed approach in predicting regulatory relations.
The cell works as a system governed by integrated action of the genes indicating that genes are functionally related; for example, they may have regulatory relations between each other or they may be concerned with the same protein complex or metabolic/signaling pathways and so on. Determining functional relations between genes enables development of a genetic network which leads to the prediction of the complex rolls of the genes in different systems in the cell. Nucleotide and/or amino acid sequence similarities have been extensively used to predict functional relation between genes [1, 2]. Affinity purification [3, 4] and yeast two-hybrid assays [5, 6] are employed to determine physical association between proteins which are gene products. Synthetic lethal screens  measure the tendency for genes to compensate the loss of other genes. Scientists have performed numerous studies in an attempt to better understand and classify digenic epistatic relationships . In  a probabilistic functional network of yeast genes was constructed by integrating diverse genomic data. In  an algorithm was proposed for regulatory networks of gene modules that combines information from genome wide location and expression data sets. Constraint-based Bayesian Structure Learning (BSL) techniques, namely, (a) PC Algorithm, (b) Grow-shrink (GS) algorithm, and (c) Incremental Association Markov Blanket (IAMB), were used to model the functional relationships between genes associated with differentiation potential of aged myogenic progenitors in the form of acyclic networks from the clonal expression profiles . Attempts have been made not only to determine functional relationship between individual genes but also to measure functional relationship between gene sets . Many more similar studies can be cited. Microarray gene expression data incorporating with other information have been extensively used for predicting regulatory relation between genes [13–15]. However it is logical to assume that expression data contains information about various types of functional relations between genes. In the present work we propose an approach for estimating integrated linear and probabilistic relations between expression profiles of genes and applied the concept to determine functional relations between yeast genes solely based on gene expression data. The proposed method successfully detected functionally related gene pairs that share many GO terms. The method also shows promise to be utilized in the process of detecting regulatory relations between genes.
2. Materials and Methods
2.1. Data Used in This Work
The data used in this work was previously used in other works [16–19]. The data is a matrix containing some missing values. Each data point produced by a DNA microarray hybridization experiment represents the log of the ratio of expression levels of a particular gene under two different experimental conditions. The result, from an experiment with genes on a single chip, is a series of log-transformed expression-level ratios. Typically, the numerator of each ratio is the expression level of the gene in the varying condition of interest, whereas the denominator is the expression level of the gene in some reference condition. The expression measurement is positive if the gene is induced (turned up) with respect to the reference state and negative if it is repressed (turned down). The data were collected at various time points during the diauxic shift, the mitotic cell division cycle, sporulation, and temperature and reducing shocks.
2.2. Missing Value Imputation
In microarray gene expression data missing values often occur due to various reasons, such as insufficient resolution, image corruption, dust, or scratches on the slide. Usually, microarray datasets are estimated to have more than 5% missing values and up to 90% of genes are affected [20, 21]. The gene expression data considered in this work contains 3760 missing values. The missing values were filled based on principal component analysis (PCA) by using the package pcaMethods . Using PCA we can model a matrix by defining two parameter matrices, the scores, , and the loadings, , such that when multiplied with each other they well reconstruct the original matrix as follows: where is the error matrix and denotes the original variable averages. Now if contains missing values but and can be completely estimated, then we can use as an estimate for if is missing.
2.3. Digitization of Gene Expression Matrix
After missing value imputation, let us denote the gene expression data matrix as . For each row of we calculate the average and standard deviation. Let for the th row the average and standard deviations be denoted as and . Now, the digitized matrix is created as follows:
In the above equations “th” is a threshold which should be a real number and in most practical cases it is within 0 to 2. We digitized the data using the values of threshold “th” as 0.5, 1, and 1.5. For each case the distribution of the genes with respect to the count of 1 s in their profiles is shown in Figure 1. In case of , the distribution approaches roughly normal and we observed similar trend in case of −1. Hence in this work we considered for the digitization of the gene expression data.
2.4. Probability Density Mass Function Table
Based on a digitized matrix containing only 1, 0, and −1 a probability density mass function table can be constructed corresponding to each gene pair indicating nine joint probabilities as shown in Table 1.
Any element of the above table (corresponding to two genes say, gene and gene ) where can be calculated by assuming TRUE = 1 and FALSE = 0 in (4) as follows:
Here is the width of matrix .
We assume that the joint probabilities of Table 1 and corresponding conditional probabilities contain important clues to estimate functional relations between genes.
In this work we hypothesize that when gene is positively functionally related to gene , then should be statistically high. Using Bayes rule we can write . Now if is very small, then can be very high and that can sometimes happen because of noisy data. To avoid this problem we can consider as an indicator that gene is positively functionally related to gene . To further strengthen the case we consider that when both and are statistically significant then gene a and gene b are positively functionally related. Considering other joint probability masses might be useful for finding functional relations between some multi function genes. By intuition we can realize that the sum of probabilities actually indicates an integrated measure of both linear and probabilistic relations between the profiles of two genes and this term will be referred to as positive linear and probabilistic relation () in the following. To our knowledge this is the first approach to measure similarity between two multivariate entities based on joint probability density masses in selected points giving emphasis on both linear and probabilistic relations.
3.1. Effectiveness of
The distribution of all gene pairs in the context of is shown in Figure 2(a). The average value of is 0.0819. We calculated for the gene pairs for which is larger than the average value. The distribution of those gene pairs with respect to is shown in Figure 2(b). The average value of is 0.429. Initially we selected the highest 1%, 2%, 3%, 4%, and 5% gene pairs from the distribution of Figure 2(b), that is, gene pairs with higher values, and determined the number of GO terms  shared by both the genes of each pair.
Figure 3(a) shows the percentage of selected gene pairs that share at least 1, 2, and, 3 GO terms and also that of equal number of randomly selected gene pairs. In the context of minimum number of shared GO terms the percentage of selected gene pairs is always much higher compared to that of randomly selected pairs. Figure 3(a) further shows that the higher the lower cutoff value of for a group of gene pairs is, the higher proportion of the gene pairs share common GO terms. To further illustrate the result we show in Figure 3(b) the actual number of shared GO terms for the highest 1% selected gene pairs and the equal number of random gene pairs which implies that the gene pairs selected based on share much more GO terms. Thus is a good measure to determine functional relation between genes.
3.2. FDR Analysis
We conducted FDR (false discovery rate) [24, 25] analysis to statistically assess the false positive rates among the selected gene pairs based on . For each pair of genes for which is above average we did the following.(i)The numbers of 1 s, 0 s, and −1 s in the digital profile of both genes are counted.(ii)Random profiles of both the genes are constructed by randomly imputing the same numbers of 1 s, 0 s, and −1 s. This process is repeated 100 times.(iii)Then, , , and are calculated for both real and random profile pairs. is the total number of profile points for which the expression level of both genes is . In case of random profiles the average values corresponding 100 random profile pairs were considered.(iv)A chi-square value is calculated as follows where is the width of the expression matrix: (v)Based on the chi-square value, a -value for the gene pair is determined using statistical software. Note that is directly proportional to .
Figure 4(a) shows the distribution of the gene pairs with respect to the -values with a -value interval of 0.05. For any given cutoff -value the FDR is calculated as follows:
Figure 4(b) shows the plot of FDR with respect to cutoff -values. As the cutoff -value decreases, FDR decreases rapidly and becomes roughly constant at -value of 0.001. There are 25559 gene pairs for which the -value is less than 0.001.
3.3. Network and Modules of the Selected Gene Pairs
Based on the FDR analysis of the above section, we selected 25559 gene pairs having highest values. Such selected gene pairs make a network consisting of 2131 nodes. We determined high density modules in that network using the network clustering algorithm DPClusO  and found 1154 modules of size 3 or more (see Supplementary File 1 in supplementary material available online at http://dx.doi.org/10.1155/2014/154594).
3.3.1. Richness of Similar Function Genes
To evaluate the richness of similar function genes in the modules we calculated their hypergeometric -values by using the package GOstats  in the context of all three types of GO terms: biological process (BP), cellular compartment (CC), and molecular function (MF). Figures 5(a), 5(b), and 5(c) show the distribution of the modules with respect to −log(-value) which implies that almost all the modules are statistically significant. We selected 10 lowest -value clusters corresponding to different GO terms from each of the three distributions of Figure 5 and their set union resulted in 22 clusters. Some biological information from the SGD database  about those 22 clusters is summarized in Table 2. Column 3 in Table 2 shows the -values and corresponding GO terms determined by GOstats. Column 4 in Table 2 shows other GO terms retrieved from SGD database associated to the clusters covering many genes which implies that almost all the genes of each of the clusters could be associated to important GO terms which confirms the fact that the proposed method is a promising way to establish functional relation between genes based on expression data.
3.3.2. Richness of Similar Binding Sites
Furthermore to verify the presence of similar binding sites in the promoters of the genes included in individual modules we used the tool PRIMA (PRomoter Integration in Microarray Analysis)  from the software package EXPANDER . Total 180 modules were found to have -values less than 10−3 in the context of binding site enrichment of 57 various transcription factors. The enrichment table generated by EXPANDER is in supplementary material (Supplementary Table 1). Table 3 shows information about 10 modules corresponding to lowest -values involving 10 different transcription factors. We downloaded a list of known regulatory relations from the YEASTRACT database  and verified whether the genes in a module have regulatory relation with the associated transcription factor. Column 6 of Table 3 shows that a large number of genes in individual modules are already reported to be regulated by the corresponding transcription factor. Only in case of CID736, though all 3 genes contain in their promoters the binding site of the transcription factor DAL82, no regulatory relation between those genes is reported in the YEASTRACT database presently. However based on our analysis regulatory relations between DAL82 and those three genes may be predicted. Thus the proposed measure can also be integrated to other types of information for developing a method to predict regulatory relations between genes which is one of our future works.
In this work we propose a novel measure to determine functional relation between genes based on gene expression data. The present approach first digitizes the log-ratio type gene expression data to a matrix consisting of 1, 0, and −1 indicating highly expressed, no major change and highly suppressed conditions for genes, respectively. Then a probability density mass function table is constructed indicating nine joint probabilities for each pair of genes. Those pairs of genes were considered as functionally related for which the sum of probability density masses in selected points are statistically significant. We applied the method to a sample gene expression data of S. cerevisiae. It was found that substantial majority of the selected gene pairs share many GO terms. Also the network consisting of the selected gene pairs contains high density modules. It was shown that those modules were rich with similar function genes. Furthermore, it was verified that for many modules many of the genes contain similar binding sites in their promoters corresponding to known transcription factors of yeast and those transcription factors are known regulators of many of the genes in the corresponding module. Above all this work introduces a new approach for simultaneously measuring both linear and probabilistic relations between multivariate entities which is useful for handling multivariate data and big data biology.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research is supported by the National Bioscience Database Center in Japan, the Ministry of Education, Culture, Sports, science, and Technology of Japan (Grant-in-Aid for Scientific Research on Innovation Areas “Biosynthetic Machinery. Deciphering and Regulating the System for Creating Structural Diversity of Bioactivity Metabolites (2007)”).
Supplementary Table 1: The tool PRIMA (PRomoter Integration in Microarray Analysis) from the software package EXPANDER was used to verify the presence of similar binding sites in the promoters of the genes included in individual modules. Total 180 modules were found to have p-values less than 10-3 in the context of binding site enrichment of 57 various transcription factors. One module may be associated to more than one transcription factor. Supplementary Table 1 is the enrichment table generated by Expander.
T. Pupko, R. E. Bell, I. Mayrose, F. Glaser, and N. Ben-Tal, “Rate4Site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues,” Bioinformatics, vol. 18, no. 1, pp. S71–S77, 2002.View at: Google Scholar
M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates, “Assigning protein functions by comparative genome analysis: protein phylogenetic profiles,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 8, pp. 4285–4288, 1999.View at: Publisher Site | Google Scholar
T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A comprehensive two-hybrid analysis to explore the yeast protein interactome,” Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 8, pp. 4569–4574, 2001.View at: Publisher Site | Google Scholar
I. Miko, “Epistasis: gene interaction and phenotype effects,” Nature Education, vol. 1, article 197, 2008.View at: Google Scholar
Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society B, vol. 57, no. 1, pp. 289–300, 1995.View at: Google Scholar
D. Abdulrehman, P. T. Monteiro, M. C. Teixeira et al., “YEASTRACT: providing a programmatic access to curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web services interface,” Nucleic Acids Research, vol. 39, no. 1, pp. D136–D140, 2011.View at: Publisher Site | Google Scholar