Abstract

Modularity in protein interactome networks (PINs) is a central theme involving aspects such as the study of the resolution limit, the comparative assessment of module-finding algorithms, and the role of data integration in systems biology. It is less common to study the relationships between the topological hierarchies embedded within the same network. This occurrence is not unusual, in particular with PINs that are considered assemblies of various interactions depending on specialized biological processes. The integrated view offered so far by modularity maps represents in general a synthesis of a variety of possible interaction maps, each reflecting a certain biological level of specialization. The driving hypothesis of this work leverages on such network components. Therefore, subnetworks are generated from fragmentation, a process aimed to isolating parts of a common network source that are here called fragments, from which the acronym fragPIN is used. The characteristics of modularity in each obtained fragPIN are elucidated and compared. Finally, as it was hypothesized that different timescales may underlie the biological processes from which the fragments are computed, the analysis was centered on an example involving the fluctuation dynamics inherent to the signaling process and was aimed to show how timescales can be identified from such dynamics, in particular assigning the interactions based on selected topological properties.

1. Introduction

PIN [1] are almost pervasively studied in genomics, but especially when H. Sapiens is considered they present limitations due to sparse coverage and suboptimal accuracy of both experimental (yeast two-hybrid, for instance) and in silico measurements (literature mining, orthology, etc.) [2, 3]. This overall uncertainty is reflected in a pathological presence of false positives and negatives and ultimately complicates data mining and analysis tasks. In order to bypass the complexities induced by such factors, data integration strategies are widely pursued (for instance, studies in [4, 5] have become quite popular). However, a difficulty comes from the fact that the integrated entities are usually heterogeneous, and thus normalization and rescaling need to be considered. An excellent example of the complexity underlying a sequence of integrative omics tasks is offered by the personal omics profiling work recently published by Chen et al. [6], soon considered a reference for personalized medicine research.

The working hypothesis of this short paper is to adopt an opposite investigation strategy compared to aggregation: instead of integrating the PIN dataset with data from other omics sources, its constituent entities were explored, considering the building blocks that biologically allow for the protein interactions to be observed and measured, at least in part. A PIN map consists of three main types of constituent entities: positive data, that is, the measured physical interactions, which represent the real evidence; negative data, that is, the interactions that are not present, considered as latent variables; and uncertain data, that is, noisy information (false positives) for which partial recovery is possible through data integration. Notably, this mix is usually measured through both transient and persistent PIN dynamics, together with the related degree of uncertainty.

This work aims to elucidate through the comparative assessment of module-finding algorithms the relationships between topologies that belong to the same network. In particular, PIN can be considered to assemble various interactions which depend on specialized biological processes. The integrated view generally offered by modularity maps represents indeed a synthesis of a variety of possible interaction maps [79] embedded in the same network. Individual reference to such maps was made, at least for a list of them, and fragPIN were used to indicate the type of network which is generated from fragmentation, a process that retrieves from the same network source a certain number of biologically differentiated subnetworks. Then, elucidation of the characteristics of modularity in each obtained fragment was carried out, helping to investigate the hypothesis that different timescales may underlie the interactive dynamics related to the biological processes from which the fragments are computed. As an example, analysis of PIN fluctuation dynamics for signaling was carried out to show how the inherent timescales can be identified, and interactions assigned to them based on selected topological properties.

Following the work of Huthmacher et al. [10], previous examples of comparative network biology analysis have been suggested by Durek and Walther [11] with the attempt to elucidate the implications of PIN for the regulation of the underlying reaction networks. A comprehensive analysis of enzyme-enzyme interactions in metabolic networks of E. coli and S. cerevisiae has thus been performed. The latter has involved the analysis of topological properties of these different but related networks and addressed issues such as the efficiency of metabolic processes and how the organization of enzyme interactions correlate with metabolic efficiency.

The methods adopted in the above papers required the study of the global network connectivity properties, various filtering steps to reveal organization differences between all interaction sets and networks targeted to metabolism, and the analysis of scale-free exponents, average cluster coefficient, degree correlation, distance, and centrality was performed. Priority was assigned to fragPIN modularity, and by computing modules according to two popular techniques, the differential configurations thus obtained were assessed. Modules are characterized by interactions occurring at different timescales and to a degree that depends on the involved biological processes. Unfortunately, technological and experimental sources cannot provide the needed detail of information. Therefore, the timescale decomposition offered by fragPIN and inherent to each particular process must be determined in some other ways, for instance in silico through the computational approach described below.

2. Methods

Similarly to all the interactome datasets, also the S. cerevisiae (yeast) interactome presents its complexities; the work of Reguly et al. [12] is an optimal choice, particularly with regard to the literature-curated interactions from small-scale experiments (among other interactome disaggregated information presented by the authors). The dataset involves 31793 publications and reports about 11334 nonredundant interactions (from a total of 33311) and 3289 proteins. Given this yeast source, a compilation of PINs was built and studied to compare their modular properties. Each subinteractome was analyzed according to the characterizing biological process. This process was called PIN fragmentation. The natural consequence of fragmentation is that specific PINs are built whose connectivity patterns reflect the dynamics inherent to the separately involved biological process. The list is reported below.(i)rPIN = Reguly LC interactome: the source interactome.(ii)mPIN = metabolic PIN. It is obtained by filtering the rPIN such that proteins with their GO terms not associated to metabolism (source: SGD db, http://www.yeastgenome.org/) are taken off. mPIN contains interactions between metabolic proteins.(iii)ePIN = enzyme PIN.It is obtained by filtering rPIN through known annotated enzymes (source: KEGG db, http://www.genome.jp/kegg/). ePIN contains only interactions between enzymes.(iv)pPIN = pathways PIN. It is obtained by filtering the rPIN through pathways retrieved from the KEGG db. pPIN contains only interactions between proteins involved in annotated pathways.(v)cPIN = cell-cycle PIN. It is obtained by filtering rPIN through proteins involved in cell cycle processes (source: MIPS, mips.helmholtz-muenchen.de/genre/proj/yeast/and SGD db). cPIN contains interactions between proteins involved in cell-cycle process.(vi)tPIN = transcription factor PIN.It is obtained by filtering rPIN through transcription factors (source: YEASTRACT db, http://www.yeastract.com/). tPIN contains interactions between transcription factors.(vii)ttPIN = transcription factor with targets PIN. It is is obtained by filtering rPIN through transcription factors from the YEASTRACT db. It contains interactions between transcription factors and their target proteins.(viii)sPIN = signalling PIN.It is obtained by filtering rPIN using signalling pathways retrieved from KEGG db. sPIN contains only interactions between proteins involved in signalling annotated pathways.

3. Results

As a first check, distributional properties are computed through the power laws, that is, , and reported in Figure 1 with reference to each fragPIN and the corresponding estimated exponents too (see [1317] for general treatment of the topic). The distributions appear quite different, as expected, and this depends on the structure and size of the fragPIN which is considered.

3.1. Modularity

Modularity is often naturally computed when networks are employed. Many algorithms have become available, and a couple of them have been selected based on the popularity and consensus achieved. The first of such methods that we employed is MCODE [18], which exploits local graph density to suggest possible associations between protein complexes and locally dense regions of a graph computed from a clustering coefficient, that is, , where is the node size of the neighborhood of node , and is the number of edges in the neighborhood. The -core is the structure that one finds in a graph; it is a network of minimal degree defined as the remaining subgraph, after that all the nodes with degrees have been removed successively.

The procedure is as follows: (a) when a node is removed, all its adjacent edges will also be removed; (b) after a node of degree ≤ is removed, in the remaining graph all the remaining nodes with a new degree ≤ also need to be removed. In other terms, given , the -core is computed by pruning all the (with their ) with degree less than until all nodes in the remaining network have at least degree .

Then, if a node -core but -core of the graph, it has coreness degree . The highest -core of a network is the central most densely connected sub-network. After vertex weighting, complex prediction is conducted where the relevance of each cluster is validated against known complexes or functional modules, and final statistics are computed about clusters size, density, and functional homogeneity.

The main modules identified for all fragPIN are reported in Figure 2 (table format). To obtain them, parameters for network scoring have been set as follows: degree cutoff = 2; for cluster finding: node score cutoff = 0.2; haircut = true; fluff = false; -core: 2; and maximum depth from seed: 100.

Modularity can then be computed by another very popular community-finding method called maximum modularity (MaxMod). To implement such a method, greedy optimization algorithms have been employed by Clauset et al. [19] to find the best possible modularity structure in networks. In summary, a greedy procedure iteratively merges module pairs showing the largest modularity increase until a gain is observed.

The optimization function [20] is reported below. It is defined as an approximate difference between links observed in a modular network versus those expected in a network of equivalent size where they have been randomly placed. Therefore, a value of zero for indicates that the fraction of within-module links is not different from what would be expected from a randomized network of equivalent size. Nonzero values of indicate deviation from randomness, and values around 0.3 suggest the presence of modular structure (this result comes from extensive simulations reported in the above references) as

The formula reports fractions of links related to nodes within a module and fractions of links coming from all other modules relatively to module . Therefore, a good partition into modules leads to approach 1; vice versa, the presence of random links between nodes (i.e., poor modularity) would make the two terms not too different, thus delivering a close to 0. Figure 3 shows cores detected in cPIN, while Figure 4 shows a community map for it.

3.2. Timescale Decomposition

Biological processes embeds dynamics that respond to different timescales; a major problem is how to measure them, in particularly in relation to interactive associations [21]. One way to introduce dynamics at the interactome scale is to integrate gene expression values ideally obtained through time course measurements. However, when such coupled measurements are not available, the problem of deciphering network dynamics is of difficult solution. In a companion paper [22], a special network decomposition approach elucidating both coarse and fine timescales through wavelets [2326] was proposed. While the focus in previous work was on some particular pathways, a generalization is put forth here.

Using wavelets depends on the entities to be measured, and those ones allowing for suitable timescale decomposition can be good candidates. Such entities, in our case, can be identified by topological features that once measured at each protein (e.g., node) contribute to quantifying a vector-valued signal. The latter can then be decomposed by wavelets. In our application, every entry of the feature vector computed from the PIN and to be decomposed across timescales represents a topological property.

An example, apart from the usually exploited degree feature, is provided by betweenness [2729]. This centrality measure is computed at each network node and increases depending on the volume of crossing at the node, that is, shortest paths (geodesics) going from an origin to a destination through the node relative to the total number of geodesics observed between start and end nodes. For distinct nodes , the number of the shortest paths from to , and the number of the shortest paths passing through , it holds that

Another problem is how to establish significant variation between timescales in the wavelet values. The approach proposed in our previous methodological paper was centered around two steps: (a) denoising [3033] applied to get rid of disturbances of random nature; (b) clustering [34] aimed to discriminate between significant and nonsignificant values.

The variability in the measures was initially analyzed through the IQR (interquartile range) robust statistic in order to select the most variable fraction of the data (the half that was selected was called coreset), while discarding the residual part (the box values proximal to the median). A second partitioning was then made of the selected data fraction. In order to control the coreset timescale specificity, some clusters were retrieved. However, also the remaining scattered values were evaluated, that is, the values not assigned to clusters.

A tight clustering technique was adopted, based on a mix of hierarchical and -means approaches integrated by bootstrap to form stable clusters. Overall, clusters did not find significant protein modules through which to analyze connectivity or inherent association power of biological relevance. Clusters were also computed over the entire sets of values (without IQR split into coreset and scattered values), and yet did not deliver biological evidence. Conversely, the analysis of the scattered feature values proved to be more fruitful in terms of reference to timescale specificity, especially for the impact on pathway proximity rather than on network connectivity.

3.3. Transient versus Permanent Interactions

A final aspect is how to measure transiency and permanence of interaction dynamics. The emphasis went on their specific interaction dynamics relative to modular connectivity computed within and between timescales, together with pathway proximity. Graphical evidence was reported through Figures 5 and 6. Basically, a scan was first produced through the entire wavelet resolution spectrum for each module under differential conditions then followed by back projection to the PIN of the established associations between particular protein interactions and timescales.

Thus, the cases for which interactive dynamics are simultaneously present at multiple timescales were visualized, together with the links that are possibly appearing between them. S1 (see S1 in the supplementary available at http://dx.doi.org/10.1155/2013/307608) reports timescale proximity at pathway level (signaling), which complements the graphical evidences reported at modular network scale. S2 reports the histograms of wavelet-decomposed feature signals (levels and their differences) and diagnostic plots; S3 reports module connectivities detected from each feature across timescales; and S4 reports GO annotation for the identified interactions.

Figure 5 shows timescale-specific interactions computed from feature-dependent modules in sPIN. Note that the diversity of colors identifies the different timescales that have been detected by the algorithms. Figure 6 reports instead much denser modules, with reference to ttPIN. In terms of comparative evaluation, while Figures 3 and 4 refer to cores and communities, respectively, and these are typical modules found in many studies after applying very well-known methodologies, the proposed approach shows their limitations in detecting resolutions or timescales. Therefore, by involving topological properties computed over specialized PINs, and in particular the information coming from the biological processes, the induced connectivity dynamics between proteins can be emphasized and suitably represented. From a biological point of view, this passage might be important for a series of reasons, (a) the possibility to adopt a differential network analysis based on a comparison of PINs evaluated before and after certain perturbations; (b) the assessment of PIN module configuration changes that might explain phenotypical alterations based on well-characterized protein dynamics.

4. Concluding Remarks

Fragments of PIN offer interesting inference perspectives. The most important aspect is that in reduced dimensionality and complexity, some specialized module functions could be analyzed and possibly validated with reference to specific aspects related to a target pathway or biological process. The second aspect of potential interest is the development of differential network analysis in response to conditions that may affect network dynamics. Finally, time and space dimensions are two entities that define network dynamics and often are overlooked; the timescale analysis here proposed is an example of computational analysis that might provide relevant information to build more accurate profiles. Without observing protein interactomic dynamics from measurements directly at the experimental level, thus embedding the dynamics from their generating timescales, an attempt to computationally dissect the interactome was made, then separating the effects induced by all the biological processes that were found to be involved. The differences that were detected find justification in a variety of reasons that cannot be inferred from the plain interactome data; however, after examining each separate PIN, a result was that in some cases the timescale dynamics can be revealed through the employed PIN topologies.

Acknowledgment

The author would like to thank his previous collaborators Elisabetta and Antonella.

Supplementary Materials

Supplementary Material : is reported to emphasize graphical evidence with the various methods used in our approach, and also annotations for the detected modules.

  1. Supplementary Materials