Abstract

Motivation. Membrane proteins play essential roles in cellular processes of organisms. Photosynthesis, transport of ions and small molecules, signal transduction, and light harvesting are examples of processes which are realised by membrane proteins and contribute to a cell's specificity and functionality. The analysis of membrane proteins has shown to be an important part in the understanding of complex biological processes. Genome-wide investigations of membrane proteins have revealed a large number of short, distinct sequence motifs. Results. The in silico analysis of 32 membrane protein families with domains of unknown functions discussed in this study led to a novel approach which describes the separation of motifs by residue-specific distributions. Based on these distributions, the topology structure of the majority of motifs in hypothesised membrane proteins with unknown topology can be predicted. Conclusion. We hypothesise that short sequence motifs can be separated into structure-forming motifs on the one hand, as such motifs show high prediction accuracy in all investigated protein families. This points to their general importance in α-helical membrane protein structure formation and interaction mediation. On the other hand, motifs which show high prediction accuracies only in certain families can be classified as functionally important and relevant for family-specific functional characteristics.

1. Introduction

Membrane proteins are essential for many fundamental biological processes within organisms. Active nutrient transport, signal and energy transduction, and ion flow are only a few of the numerous functions enabled by membrane proteins [1]. Membrane proteins obtain their specific functionality by individual folding and interactions with the hydrophobic membrane environment as well as, in many cases, by oligomeric complex formation and protein-protein interactions [1, 2]. The identification of such complexes and interactions is valuable, since, on the one hand, detailed information of the function of an unknown membrane protein can be obtained by analysing its interactions with proteins of known function. On the other hand, biological processes can be comprehended as a dynamically fluctuating system, whereby the biological role of the unknown membrane protein can be defined more precisely [3, 4]. Accordingly, destabilisation of the three-dimensional structure of a membrane protein caused by mutations or ligand interactions are triggers for numerous diseases, for example, diabetes insipidus, cystic fibrosis, hereditary deafness and retinitis pigmentosa [57].

Although 20%–30% of all open reading frames of a typical genome are encoding membrane proteins [5, 8, 9] and 60% of all drug targets are membrane proteins [2], membrane proteomics is still an experimentally challenging field due to poor protein solubility, wide intracellular concentration range, and thus, inaccessibility to many proteomics methodologies [10]. Hence, the number of known three-dimensional structures is relatively small, with 394 nonredundant membrane protein chains currently available [1113]. Therefore, there is a necessity for approaches that allow to predict structural and functional features of unknown membrane proteins. A variety of methods have been developed to predict structural features from sequence, such as -helical membrane-spanning helices and extra/intracellular domains (i.e., TMHMM [14], PHDhtm [15], MEMSAT3 [16]) as well as membrane-spanning beta-strands of transmembrane -barrel proteins (i.e., BOCTOPUS [17]). Furthermore, in genome-wide membrane protein sequence analyses, numerous short conserved sequence motifs were identified [18]. As an example, the most widely discussed GxxxG motif has been shown to be significantly present in transmembrane -helices. With both glycines resting on one side of the helix as spatially neighbouring residues and by that forming a smooth helix membrane surface, structural studies confirmed that the GxxxG motif plays an important part in mediating helix-helix interactions [1822]. In general, short conserved membrane protein motifs are considered to be significantly relevant for membrane protein folding and structural stability as well as being involved in defining a protein’s function. Hence, sequence motif analyses and resulting insights can support the understanding of protein dynamics. Information can be derived which may contribute to study the dynamics of mutant proteins and the effects of mutagens [2325]. Additionally, as addressed in [26], the analysis of sequence motifs in proteins with similar function or structure might help to identify essential functional sites and locations which contribute to structural stability.

In this work, we focused on previous studies and results that have been reported by Liu and colleagues [18]. In the process, various integral membrane protein families with polytopic membrane domains had been obtained from Pfam database [27]. As part of their studies, locations of the least conserved residues (glycine, proline, and tyrosine) in -helical transmembrane regions had been investigated. As a result, short motifs consisting of pairs of small residues (glycine, alanine, and serine) surrounding single or multiple variable positions had been identified in conserved sequences and Pfam-classified families. Based on these results, we have developed a prediction approach to allocate the topological state of a sequence motif in the protein structure based on sequence information. We have used cross-validation to verify the prediction accuracy. However, prediction accuracy has been found to be variable for certain motifs with regard to the investigated protein families. According to this, we hypothesise that short sequence motifs can be separated into structure-forming motifs on the one hand, as such motifs show high prediction accuracy in all investigated protein families. This points to their general importance in -helical membrane protein structure formation and interaction mediation. On the other hand, motifs which show high prediction accuracies only in certain families can be classified as functionally important and relevant for family-specific functional characteristics.

2. Materials and Methods

2.1. Used Membrane Protein Families

As the first step of our analysis, 32 membrane protein families with domains of unknown functions (DUF) were obtained from the Pfam database [27] using extended keyword searching. All 7051 sequences were retrieved for statistical analysis. The full list of employed membrane protein families is given in Table 1. Subsequently, 50 sequence motifs, identified by Liu and colleagues [18], were localised in the obtained set of families.

2.2. Programs and Tools

To avoid generating misguiding statistics by including identical or highly similar sequences, a set of nonredundant sequences was generated. Here, we defined the sequence redundancy threshold at 25% sequence identity. In the first step of sequence processing, CD-HIT [29] was applied for first clustering. However, CD-HIT accepts only nonredundancy thresholds of 40%. This limitation is caused by the internal word-length filtering approach and statistical presets. Hence, to ensure clustering sensitivity, a 60% nonredundancy threshold, which corresponds to tetra-peptide word filtering used by the program, was applied. In the second step, sequence clustering using the 25% redundancy threshold was obtained by means of utilising BLASTClust [30]. The representative sequences of all clusters were extracted, leading to a set of 2511 nonredundant sequences.

Subsequently, the determination of membrane and nonmembrane associated sequence regions was derived using by the TMHMM Server v. 2.0 [14]. Basically, TMHMM performs a prediction of intra/extracellular regions and integral membrane helices based on sequence. Additionally, the probability of the prediction is given for each residue as well. According to the obtained results from TMHMM, a topological state was assigned to each residue. A residue was assigned as “TM” if the posterior prediction probability of this residue being a part of a membrane helix has been found to be greater than 90%. If the posterior prediction probability of the residue has been found to be greater 90% for extra/intracellular prediction, the residue was assigned as “nTM.”

2.3. Used Motifs

The short sequence motifs analysed in our work have been reported in [18]. In this study, Liu et al. analysed consensus sequences of 168 Pfam-A families to identify significant amino acid pair motifs. By the comparison of their results in earlier published findings (see [20]), a list of 50 significant motifs has been derived which we used in our work (for original data see [18], Table 1, List 3): GG4, GL3, GG7, GL1, AG7, GA7, AG4, PL2, AS4, AL6, LP1, PG9, GA4, FG1, SL1, SG4, PL1, AA7, AG5, LF8, IA1, GV1, AI1, AA2, GL2, AA3, SL2, PG5, PG6, IL4, GS5, VL4, GV2, IG1, PG10, LY6, LF10, SA6, LG5, SA3, PF1, GS4, IV4, LS1, GY8, IG2, LF9, VF8, VG6, GN4.

Intuitively, the reported short sequence motifs can be written in a generalised, regular expression-like form of XYn, where X and Y correspond to amino acids separated by highly variable positions. However, in the process of analysis we found that short motifs with a relatively small number of variable positions (more precisely, if is found to be 3) do not contain enough information to be investigated by our approach. Thus, these motifs have been discarded in the process, which resulted in a final set of 33 sequence motifs. In our nonredundant sequence set, almost 250,000 single motif occurrences were identified. As an example of motifs located in a membrane protein structure, Figure 1 illustrates seven motifs which can be found in the structure of the bacteriorhodopsin (PDB-Id: 1brr).

2.4. Information Extraction and Clustering

In this work, a novel approach is elucidated which predicts the topology state of a short sequence motif in membrane proteins. The following steps were completed to realise this approach.

At first, all single motif occurrences were identified in the nonredundant sequence set. Including TMHMM predictions, each motif occurrence was assigned to a topology state as elucidated in Section 2.2. Additional to the defined topology states “TM” and “nTM,” a further state has been defined for this study. Each motif, where the beginning and the end has been located in the different topology states “TM” and “nTM,” has been assigned with the “trans” state. Subsequently, all variable positions within each motif occurrence were examined more closely. Ultimately for each variable position, the relative occurrence of each amino acid at the specified position of each motif was calculated.

To define a separation rule for the investigated motifs, an information-based approach was applied. Formally, a motif , for instance LG5, can be interpreted as a set of variable strings with a length of . Intuitively, in case of LG5 equals 4. To include the membership information of the three topology states, we separated into three motif subsets , and according to the topology state in which each single motif occurrence is located. Furthermore, in each motif each position with can be investigated concerning its amino acid distribution. To this end, interpreting as a set of strings (all identified motif occurrences found in topology state ) allows formulating the relative probability : with where corresponds to one of the 20 canonical amino acids. To weight the significance of each probability , the probability is applied in a log-odd formula:

The amino acid distribution used to test the significance of the observed relative probability at each motif position was computed from the NCBI nonredundant protein sequence set [32] (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz).

Using these log-odd values, visualisation, clustering, and information extraction can be performed. To this end, we transformed each position into a vector consisting of log-odd values which we refer to as log-odd profile and which is defined as Clustering all resulting was finally ensured by implementing the following distance formula: where corresponds to the Spearman’s rank correlation coefficient. Clustering methods were applied to the LOPs to derive characteristics in motifs which determine the protein’s structural and functional features.

Furthermore, with these values at hand, the algorithm for predicting the topology state based on a single motif occurrence was implemented. At this, the precalculated LOPs of the corresponding motif are employed as look-up values to compute a straight-forward winner-takes-it-all formula: The assessment of topology state prediction was performed by means of cross-validating and F-measure calculation.

By utilising clustering methods, differences and similarities of all LOPs can be visualised and analysed in detail.

For dimensionality reduction and finally data clustering of the 20-dimensional LOP data, we used the unweighted pair group method with arithmetic mean (UPGMA) [33] and the exploratory observation machine (XOM) [34]. This analysis is helpful to understand the correspondences of physicochemical properties observed in LOPs and topology states. Furthermore, this analysis enforces the found predictability of topology states. We chose the UPGMA as visualisation approach, since it is a widely used bottom-up clustering method that can be understood intuitively.

The XOM algorithm is relatively new for dimensionality reduction. A great advantage lies in its visualisation capabilities, since it can transform neighbourhood or distance relations embedded in multidimensional data into human-intelligible spaces, such as into . In the literature, this property is referred to as topology-preserving mapping. However, the degree of topology-preserving mapping achieved by the XOM depends on the given problem (mainly influenced by the structure of data and applied distance measure), and thus the XOM output can be insufficient for analysis. In application to LOP data, however, it has shown to perform more than satisfying. Further, visualisations were obtained by generating heat maps.

3. Results and Discussion

3.1. Identification of Topology-Discriminative Positions

The identification of topology-discriminative positions in motifs is crucial for drawing meaningful correlations between physicochemical properties plus structural and functional features. A straight-forward approach to address this task is the utilisation of a method to determine the residue conservation at each variable motif position. WebLogo [31], for instance, is a widely used method to address such problems. However, WebLogo does not include any amino-acid-specific background information in deriving residue conservation, since natural amino acid frequencies are not taken into account. To circumvent this problem, we used LOPs for visualisation instead, which, as shown in (4), include natural amino-acid-specific background probabilities. Essentially, this approach is quite similar to the methods recently described in [36]. Single LOPs can be visualised as heat maps [37] (see Figure 3), and amino-acid-specific propensities at each variable position in each motif can be extracted and thus information can be gained.

3.2. LOP Visualisation and Classification

The LOP heat map depicted in Figure 3 exemplary shows the apparent amino-acid-specific propensities according to the three topology states. Here, increasing amino acid propensities defined in (3) are illustrated by increasing red colour content. In comparison to the WebLogos (Figure 2), distinct amino acid propensities become obvious. For instance, glycine is observed more frequently in all LG5 motifs which are located in transmembrane regions. In nontransmembrane regions, the propensity of glycine is found to be reduced distinctly. As a second example, the LG5 motif found in transmembrane regions, leucine is observed more frequently at the third variable position as at other positions. This sequence constellation results into two spatially adjacent leucine residues that form a bulky helix surface. In general, relations of topology states and the amino-acid-specific propensities can be derived. This emphasises the predictability of topology states based on single motif occurrences. The full LOP heat map generated by this approach consists of 471 motif positions. To visualise LOP-wide correspondences, we applied UPGMA hierarchical clustering as well as the XOM algorithm. Distance measurement between LOPs was realised by utilising (5). Since 471 variable motif positions were investigated, the UPGMA-tree generated by the first approach consists of 471 nodes. To ease the analysis of the tree, the nodes were coloured according to the topological state in which the corresponding motif is located. Due to the huge number of nodes, we depicted the tree only as a schematically representation which represents the observed general tree topology and identified memberships (see Figure 4). As shown, a distinct clustering, more precisely a formation of three distinct subtrees, according to the topology states is obvious. The cluster arrangement correlates to the physicochemical properties found in membrane and nonmembrane located regions, since greater LOP distances are mainly dictated by the propensities of hydrophobic, hydrophilic, and polar amino acids. The sub-tree mainly consisting of motifs located in “trans” regions is arranged in between, which points to intermediate physicochemical motif compositions and equally distributed amino acid compositions. Similar to these findings, the XOM output (see Figure 5) shows three main clusters which correspond to the topology states too. Additionally, the cluster arrangement is found to be equal to the arrangement observed in the UPGMA-tree, where the causes of cluster formation are analogue as well. The distinct cluster formation observed by the output of both methods points to a good separability of the variable motif positions.

A possible approach to predict the topology state of a motif from the amino acid sequence alone was implemented as elucidated in Section 2.4. In this calculation, for each motif, the three log-odd sums of all variable positions are computed with respect to the three topology states. The highest log-odd sum leads to the topology state winner (see (6)). Cross-validation was performed by excluding the evaluation set of motifs from the training motif set, which was used to generate the look-up log-odd values. In the process, each topology state winner has been assessed by F-measure. The corresponding F-measures for each investigated sequence motif are listed in the given result Tables 1, 2, and 3. It is apparent from these tables that there are motifs with high and rather small F-measures. Each representative F-measure value indicates how good or bad a motif can be separated and assigned to the respective topology state. For example, the LY6 motif with an F-measure 0.8 in all result tables says that this motif is well assignable (by (6)) to each topology state.

3.3. Evaluation of the Prediction Accuracy

To evaluate the prediction accuracy, our new approach has been applied to three datasets. The first dataset (EDS1) consists of DUF-families sequence information described in previous Section 2.1. The second dataset (EDS2) consists of 2254 membrane protein sequences with 55 known structures of the bacteriorhodopsin-like protein (PF01036) family. EDS2 was also obtained from Pfam database [27]. EDS1 and EDS2 include the topology specific recorded statistically occurrence for each motif generated from TMHMM information. These statistics are listed under the “TMHMM prediction”-table heading and the right of it followed by our predicted (see (6)) information. The prediction quality is determined by the respective F-values. The comparison evidence of the number of statistical determined motifs with the predicted ones shows how well our approach for the most motifs works. For all proteins from DUF families and for the bacteriorhodopsin-like protein families, our approach works well and can be stated for the majority motifs. Deviations can be traced back to motifs with different functions. Furthermore, our approach has been transferred to all common known structures. EDS3 as third evaluation dataset consists of all known alpha helical membrane proteins with structures obtained from PDBTM [13]. It is important to note that results from EDS3 only include PDBTM protein information. That means, each found motif has been annotated with one of three given topology states “H,” “Side1,” and “Side2,” in which “H” stands for alpha-helix structure and both Side states refer to the outside or inside of the membrane. Here, “H” can be equated with “TM” because “H” includes only alpha-helical information referring to the interior of the cell membrane. Both Side states can be equated with “nTM.” The “trans” state is not included at this point by less membrane information. This means that we have separated a motif into three motif subsets , , and according to the topology state in which each single motif occurrence was located. Further calculations are described in Section 2.4 based on these motif subsets. All results from Table 4 show that our approach can be applied on known structures. The topology specific recorded statistically motif occurrence is listed in the “PDBTM prediction”-table heading and the right of it followed by our predicted information.

4. Conclusion

In this work, 33 short sequence motifs reported in [18] were investigated in 32 polytopic membrane protein families with domains of unknown functions. Transmembrane and nontransmembrane sequence regions were predicted using the TMHMM method [38] and topology states were annotated to all detected sequence motif occurrences. These amino acid propensities were derived and employed to define log-odd profiles (LOP) of all variable sequence positions in the investigated motifs. Propensity tendencies according to the topology states were identified using UPGMA and XOM clustering. Both methods pointed to good separability and predictability of the topology state of a motif from its amino acid sequence. An information-based prediction algorithm was implemented and assessed using cross-validation and F-measure evaluation. Motifs showing high F-measures over all or only in certain investigated protein families were identified. From this insight, we postulate that short sequence motifs can be divided in general, structure-forming elements, which are present in numerous protein families and highly specific to their topology location. But they are probably less important for functional properties. Finally, motifs showing high F-measures only in certain membrane protein families may be important elements in establishing the individual properties which are necessary for the function of an entire protein family.

Also, the information of the spatial structure and the folding of proteins to be explored can be evaluated by affinities, because the spatial structure of proteins has been stronger conserved in evolution than the sequential composition of the folded protein chains. These are individual motifs or characteristic sequence parts which expose a certain biochemical function of proteins. Why does the nature pursue the principle of structure and function separation? Residues, which support a stable domain folding, are separated from those that induce a specific function. This procedure is a very efficient strategy of evolution. Two areas were simultaneously optimised [39]: (i)the stability of the protein backbone in a given folding pattern,(ii)the design of the amino acid sequence according to a specific function. Based on this information, further work will discuss and deal with how the evolution has spawned motifs in their function as structure building blocks. In addition, motifs originated by evolution and spatially interacting with other should be determined as structure stabilizing.

Acknowledgment

The authors would like to thank the Free State of Saxony and the European Social Fond (ESF) for the financial support.