Abstract

Detailed and innovative analysis of gene regulatory network structures may reveal novel insights to biological mechanisms. Here we study how gene regulatory network in Saccharomyces cerevisiae can differ under aerobic and anaerobic conditions. To achieve this, we discretized the gene expression profiles and calculated the self-entropy of down- and upregulation of gene expression as well as joint entropy. Based on these quantities the uncertainty coefficient was calculated for each gene triplet, following which, separate gene logic networks were constructed for the aerobic and anaerobic conditions. Four structural parameters such as average degree, average clustering coefficient, average shortest path, and average betweenness were used to compare the structure of the corresponding aerobic and anaerobic logic networks. Five genes were identified to be putative key components of the two energy metabolisms. Furthermore, community analysis using the Newman fast algorithm revealed two significant communities for the aerobic but only one for the anaerobic network. David Gene Functional Classification suggests that, under aerobic conditions, one such community reflects the cell cycle and cell replication, while the other one is linked to the mitochondrial respiratory chain function.

1. Introduction

The difference between aerobic and anaerobic conditions at the molecular level has drawn considerable attention in last twenty years. Man and Pilpel [1] found that the transcription rate of mitochondrial gene under aerobic conditions is significantly higher than that under anaerobic conditions, while glycolytic genes are more active under anaerobic conditions. Hou et al. [2] found that under anaerobic conditions SPT3 and SPT15 are overexpressed, which may not only enhance resistance to ethanol and stress but also upregulate the fermentation transcription factors in Saccharomyces cerevisiae. Jiang et al. [3] found that, under anaerobic condition, mitochondrial function is weakened while fermentation capacity is enhanced in Saccharomyces cerevisiae. This suggests competition between aerobic and anaerobic metabolisms as a result of evolution.

A fundamental goal of genomics is to understand how gene regulatory networks actually give rise to cellular phenotypes. Modeling and reconstruction of gene regulatory networks based on high-throughput data have become one of the most common goals of systems biology. Diverse models of gene regulatory networks have been developed such as Boolean network [46], probabilistic Boolean network (PBN) [7, 8], and Bayesian network [912]. The main objective of these network models is to study the logical interactions of genes based on large-scale microarray data and further get meaningful biological information. Recently using network method to study gene interactions of Saccharomyces cerevisiae has attracted the interest of several researchers [1316]. For example, Zhang et al. [14] examined the integrated gene interaction network of Saccharomyces cerevisiae and found many enriched multicolor network motifs corresponding to different biological themes. They concluded that significantly enriched motifs in the network are often signatures of network themes, higher-order network structures that correspond to biological phenomena. Lee et al. [15] modified a probabilistic functional gene network of the baker’s yeast, Saccharomyces cerevisiae, and experimentally verified the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. Hu et al. [16] profiled transcriptional responses in Saccharomyces cerevisiae strains with individual deletions of 263 transcription factors. Then they reconstructed a functional transcriptional regulatory network between these transcription factors and analyzed the enrichment of promoter motifs on these transcription factors.

In 2005, Bowers et al. [17] introduced a computational approach called logic analysis of phylogenetic profiles (LAPP), which identified detailed logic relationships among gene triplets on the basis of genomic data. This method may be used for functional annotation of proteins and genes and for designing biochemical experiments to elucidate biological mechanisms. Lately, further progress has been achieved on the theory and application of higher logic. Zhang et al. [18] described a three-way gene interaction model that captures the dynamics of coexpression relationships between two genes. Shoemaker and Panchenko [19] proposed ways to address the defects of the LAPP method, such as high computational complexity, strong dependence on information spectrum, and the uncertainty of homology detection at large genetic distances. Sprinzak et al. [20] detected coordinated regulation of multiple protein complexes using logic analysis of gene expression data and identified protein complexes by mapping specific kinds of gene triples to multicomplexes triplets. Notably, the LAPP method is related to stochastic logic. Modeling approaches using stochastic logic, such as stochastic Boolean network (SBN) and stochastic multiple-valued network (SMN), have been already proposed [2124].

In this paper, we focus on the construction, analysis, and comparisons of structural characteristics of logic networks inferred from the gene expression profiles of Saccharomyces cerevisiae under aerobic and anaerobic conditions. Firstly, gene expression profiles are discretized into multiple values. Secondly, the logical AND, OR, and NOT operators are given by algebraic formulation on multiple values. Down- and upregulation self-entropy as well as joint entropy are used to compute the uncertainty coefficient. Four parameters of the logic network are generalized to more complex networks for our purpose. Putative regulator genes of respiratory mechanisms are identified by contrasting the differences of the four structural parameters. Lastly, the Newman fast algorithm is used to discover the community structures of the logic network. We find that the gene logic network under aerobic condition (aerobic network) has two significant communities while the anaerobic network only has one in this framework. Furthermore, David Gene Functional Classification (http://david.abcc.ncifcrf.gov) [25] reveals the possible biological function of these communities.

2. Materials and Methods

2.1. Expression Data

The gene expression profiles data of DNA chip in this study are taken from the GSE11452 database using GPL90 platform in the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov) (NCBI) and comprise 42 and 52 expression vectors with more than 20,000 genes in yeast under aerobic and anaerobic conditions (Table 1). Such a large number of genes incur an intractable computational complexity, so we select candidate genes based on the Wilcoxon rank sum test [26] as follows. Let ,   denote and -sized random sample of expression of gene under aerobic and anaerobic conditions, respectively. The Wilcoxon rank sum test statistic is derived from the concatenation of and , which results in vector This is done by first sorting all points in by ascending order, obtaining the sets of ranks for the and points, respectively, then calculating the corresponding sums and , and finally defining the test statistic as . Under the null hypothesis the expected value of this statistic is zero; otherwise the alternative hypothesis holds, in which case ; ; then gene is called a candidate gene. Setting the significance level to be , we obtain 73 candidate genes finally.

2.2. Methods
2.2.1. Construction of the Gene Logic Network

The expression profile of candidate gene containing samples is denoted in this work by the vector To discretize this vector, we define , , ,  , where the tuning parameter controls granularity on the interval Then for , we set if , , and similarly if , obtaining the discretized expression vector In order to calculate first-order and second-order logical relations using LAPP method, we define the following quantities for this discretized vector:(1) Downregulation self-entropy is .(2) Upregulation self-entropy is Then self-entropy is , where and are the corresponding frequency of components and , respectively, in vector Considering two vectors and , we define the following joint entropies:(3) ,(4) ,(5) ,(6)And total joint entropy is , where , are the corresponding frequencies of component in vectors and , respectively. For discretized vectors and , the uncertainty coefficient ( value) is defined asThis quantity informs on the probability that regulates Note that for simplicity and denote not only expression vectors but also the corresponding genes. The first-order logical relationship between genes and is determined as where is the proper functions of first-order logic of to . The uncertainty coefficient of is where is one of the proper functions of second-order logic of to Table 2 lists ten types of proper functions and the corresponding algebraic operations. Using these operations on and , we get Table 2 also gives the algebraic representation of three basic operators for multiple values: logical NOT is represented as , logical AND and OR can be represented as , , where , are the th component of ,

We normalize the value for each gene triplet and database by replacing it with , where is the maximum value in the own database. For simplicity, the normalized values are also denoted by The condition is used to filter out all gene triplets. The combination requires gene to be better predicted from genes and together than just gene alone or gene alone. Figure 1 shows an example for a gene triplet, for which the logical AND operation on gene and gene is denoted by As in the LAPP method, all such gene triplets, with the corresponding the values, give rise to the gene logic networks further studied in our present work.

2.2.2. Structural Parameters Definition of Logic Network

In fact, the logic network can be seen as a directed and weighted network without multiple edges and self-loops. Let be a logic network with gene node-set , directed edge-set , and function that assigns each edge and weight In fact, the edge weights can be interpreted as the uncertainty coefficients that express interaction strength between gene triplets. Various structural parameters have used to study network structure in complex network [27]. For the logic network, structural parameters are detected again including average degree, average path length, average clustering coefficient, and average betweenness to capture structural characteristics of the logic network from different angles.

(1) Average Degree (). The degree of a node is the number of nodes adjacent to it. The average degree is the average value of the degrees of all nodes. For the logic network, the definitions of in-degree, out-degree need to be reconsidered based on the principle that the sum of in-degree and that of out-degree of all the nodes in a network are equivalent. Based on the principle, in-degree and out-degree can be defined according to second-order logical relationship; if there are activations of , then in-degree of increases by However, the out-degrees of and are determined by the proportion of their contributions to the second-order logical relationship. Here we can assume that the proportion contribution to the second-order logical relationship from and that from are always equivalent. In other words, the out-degree increment of is the same as that of

(2) Average Clustering Coefficient (). For a certain node with second-order logical relationships, we need to define doublets of second-order logical relationships to measure the clustering coefficient. A doublet of second-order logical relationships is a combination of two second-order logical relationships with at least one common node.

If the common node is , we call this second-order logic doublet centered on Figure 2 shows all possible second-order logic doublets centered on These three types of second-order logic doublets are named according to the different positions of as “both-in,” “both-out,” and “in-out” doublets. If there are two common nodes in a second-order logic doublet, we call the doublet strong connected. Figure 3 shows all possible strong connected second-order logic doublets centered on The number of second-order logic doublets (including both-in, both-out, and in-out doublets) centered on is denoted by , and the number of strong connected second-order logic doublets is denoted by The clustering coefficient of node , denoted by , in a logic network with second-order logical relationships is defined as So the average clustering coefficient of network is defined as

(3) Average Path Length (). Path and its length should be reconsidered according to the different second-order logic types: AND, OR, and XOR. Take a second-order logical relationship , and an arbitrary node in other than (say ), the shortest directed path , and the distance from to are defined as follows.

Case 1 (second-order logic type is AND). Namely, nodes and regulate node cooperatively.(1) is or (say without loss of generality). The shortest paths from to are denoted by , if there is at least one directed path from to The directed path from to arrives at through and then the second-order logical relationship to The distance from to is the total sum of and , where is the length of the second-order logical relationship. In our study, is estimated by the reciprocal of the uncertainty coefficient of this second-order logical relationship. On the other hand, if and are not connected by a directed path, there is no path from to (2) is neither nor . is reachable from at most one of and Then there is no directed path starting from and ending at . is reachable from both and Then there is at least one directed path connecting and The distance from to is The directed path first reaches the nearer one of and followed by the other and finally .

Case 2 (second-order logic type is OR). Either or can regulate independently. The distance from to is estimated by the probability that activation of results from The next step is to distribute the uncertainty coefficient of the second-order logical relationship. Calculate the uncertainty coefficients caused by and , denoted by and , respectively, as follows:On the basis of above obtained uncertainty coefficients, the shortest directed path and the distance from to can be determined as follows.(1) is or (say without loss of generality). There is at least one path starting with and ending at Choose one of the shortest directed paths from to randomly, denoted by . The total sum of and the reciprocal of are denoted by and , respectively. If is less than , the shortest directed path goes directly from to through the second-order logical relationship, and the distance equals Otherwise, first arrives at and then , and is equal to . There is no path from to The path from to is only the second-order logical relationship, and the distance equals the reciprocal of (2) is neither nor . There is at least one path starting from towards or Therefore, the distance from to is . The shortest directed path from to is the corresponding path to the choice of distance. Note that when is unreachable from (or ), (or ) is infinite. Neither nor is reachable from . The distance from to is infinite and no path connects them in this case.

Case 3 (second-order logic type is XOR). Both and can activate cooperatively or independently. Therefore, XOR type of second-order logical relationship is a combination of AND type and OR type. However, when only one of and is reachable from , the condition is the same as OR logic. When both and are reachable from , or is or , it is the same as AND logic.

According to the definition above, all shortest directed paths and all distances can be found in a logic network. And the average path length of a logic network is defined as , where is the set of ordered pairs of nodes and the distance from the first one to the second one is finite; that is, is the number of the elements in

(4) Average Betweenness (). Betweenness centrality [28] is one indicator of a node’s centrality in complex network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness centrality has a large influence on the transfer of items through the network, under the assumption that item transfer follows the shortest paths. Let be the set of all the shortest paths (allowing more than one shortest paths between two nodes) in the logic network If node appears at least once in a directed path starting with and ending at , then is referred to as intermediate node in this path, denoted by The standard betweenness of , denoted by , is defined as

The standard betweenness centrality ranges from zero to one. Higher betweenness means larger possibility to appear in the shortest paths. It is more important in the structure and information transfer of a network. Therefore betweenness can help to discover crucial nodes that may have significant impacts on the structural characteristics of the logic network.

2.2.3. Community Structures in the Logic Network

Another important feature of complex network is its community structures, which depict the organization of vertices into clusters, with many edges joining vertices of the same cluster and comparatively few edges joining vertices of different clusters. The community is a good tool to describe network structures and provides better understanding of network functions. Several researchers [2931] proposed algorithms to detect community structures. Newman fast algorithm [32] is a greedy modularity algorithm starting from a set of isolated nodes. The links of the original graph are iteratively added such that the largest possible increase of the modularity at each step is achieved. The fast algorithm is used to find the community structure in the logic network. In addition, the inside-to-outside-of-community ratio is used to evaluate how close of community connections in the network, where is number of edges within a community and is the number of edges cooperating the internal and external node in a community.

3. Results

In order to highlight the characteristics of the logic network structures, we contrast the change curves (Figure 4) of four structural parameters along with threshold from 0.1 to 0.9 between the aerobic network and the anaerobic network with step length 0.1. We find that the average degree, average clustering coefficient, and average betweenness of the aerobic network are greater than those of the anaerobic network for all thresholds. It can be seen that the average path length of the aerobic network is greater than that of the anaerobic network in some threshold ranging from 0.3 to 0.7. The significant changes of parameters mean that the energy metabolism conditions for Saccharomyces cerevisiae in aerobic and anaerobic respirations actually differ on the molecular level.

Each node of the logic network corresponds to a different structural parameter. From this, we obtain the degree, clustering coefficient, betweenness, and path length of each node. By calculating and ranking the difference values of the four parameters for each node, we capture the top five gene nodes. In Table 3, -Difference, -Difference, -Difference, and -Difference denote the difference values of the four structural parameters, respectively. Finally we get the intersection of these gene nodes which includes genes ATP6, YIG1, RGI2, BAG7, and COX1. The structural parameters of these genes change significantly comparing the aerobic network with the anaerobic network. For example, some genes are allocated a higher degree in the aerobic network than in the anaerobic network. That is to say, these genes have great contribution to degree structure changes for the two networks, so we define them as the structural key genes.

If gene connects other genes by a certain second-order logic relationship, then these genes together from one set are denoted by For example, if , then , ; if , then If the gene of has a particular function, then we predict that gene also has the same function. David Gene Functional Classification analysis is carried out on The second column of Table 4 shows that the functional annotation of these genes has been detected in David database while the final column lists the predicted functions of them.

Table 5 shows nonisolated nodes, the numbers, and modularity of community structures. Utilizing Newman fast algorithm, we find that the aerobic network has two obvious community structures. The modularity is 0.3756 including 47 and 15 nonisolated nodes. The corresponding ratio of inside and outside of community is 87.3 and 19.8, respectively. There are three community structures in anaerobic network, including 13, 16, and 5 nodes. The corresponding ratio of inside and outside of community is 1.8, 6.0 and 0.57 (see Table 6).

4. Conclusions

By David Gene Functional Classification, we predict that the genes ATP6 COX1 BAG7 and YIG1 are involved in the yeast mitochondrial respiratory chain while RGI2 is involved in membrane transportation (Table 4). In fact, ATP6 [33] belongs to the family of genes, which is called the mitochondrial respiratory chain complex. It participates in the respiratory chain and provides information for synthesis of a protein encoded by mitochondrial DNA that is essential for normal mitochondrial function. YIG1 [34] encodes protein and is involved in regulating anaerobic glycerol metabolism in Saccharomyces cerevisiae. Deletion or overexpression of YIG1 significantly affects growth yield or glycerol yield in anaerobic batch cultures. This is consistent with the previously proposed low flux control exerted at the Gpp level. BAG7 [35] is involved in the energy metabolism under aerobic conditions; its expression is induced under carbon limitation and suppressed under high glucose; COX1 [36] belongs to the family of cytochrome C-oxidase, catalyzing the reduction of oxygen to water in the respiratory chain. RGI2 [37] is a protein of unknown function associated with metabolism under conditions of aerobic respiration; its expression is induced under carbon limitation and suppressed under high glucose. It is involved in the control of metabolism and significantly contributes to cell fitness, especially under respiratory growth conditions.

Some biological functions of organism are realized by gene interactions of proteins. These genes are closely linked, thus revealing the community phenomenon of the logic network. We produce the spider diagrams in Figures 5 and 6 using the Pajek software (http://mrvar.fdv.uni-lj.si/pajek/). The aerobic network is shown to have two distinct communities while the anaerobic network has an obvious community. These communities may correspond to certain biological functions. So we apply the David Gene Functional Classification analysis to these communities. The result reveals that, in the first community of the aerobic network, seven genes including 4533_at, 5527_at, 8444_at, 8426_at, 10446_s_at, 6116_at, and 6908_at are related to the cell cycle function. Ten genes including 2565_s_at, 2498_s_at, 3215_f_at, 8822_at, 6044_at, 2361_s_at, 6590_at, 10644_at, 10041_at, 8431_at, and 2425_at are associated with the DNA transcription. In the second community of the aerobic network, 12 genes including 3975_at, 3970_s_at, 3974_at, 3959_at, 4008_at, 3988_at, 3987_at, 2623_s_at, 2622_s_at, 2845_g_at, 2793_s_at, and 3966_i_at are involved in the mitochondrial respiratory chain.

5. Discussions

For the nonbinary gene expression profiles, this work gives a novel approach in calculating the self-entropy (including downward self-entropy, upregulation self-entropy) and joint entropy for gene vectors. LAPP method is utilized to find all gene triplets. Furthermore the gene logic networks are constructed. The relationships between the structure and the function of the network could help us to understand metabolisms of Saccharomyces cerevisiae on the molecular level. To analyze the structural difference between two networks, parameters such as average degree, average clustering coefficient, average path length, and average betweenness, which reveal that the second-order logical types have significant differences among the different experimental gene sets, have been generalized to the networks. The differences may provide us with a new idea and some reference to the biologists on their research work. However, how the other analytical methods, such as community structures, can be generalized into the logic networks is still an interesting issue. By applying the Newman fast algorithm to the logic networks, we find that the aerobic network has two significant communities, while the anaerobic network has only one significant community. Furthermore the David Gene Functional Classification shows that one of two communities possibly relates to cell cycle and cell replication functional group; the other possibly correlates mitochondrial respiratory chain functional group in the aerobic network.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The research is supported by Grants 61170183, 10971122, and 61272018 from National Natural Science Foundation of China, Scientific Research Foundation of Shandong University of Science and Technology for Recruited Talents (2014RCJJ033). The authors would like to thank Dr. Attila Gulyás-Kovács at the Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, for their help in revising the paper.