The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here) from the protein distribution densities in the LD space defined by ln(L) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.

1. Introduction

Determination of complete genome sequences for a number of organisms has offered an unprecedented opportunity for biological community and transformed biology into a discipline that depends significantly on how to classify and interpret large-scale data sets and to extract biological insights from these data sets. The traditional ways of thinking and approaches from the pregenomic era (e.g., the sequence comparison/alignment and homology identification) are of fundamental importance in the postgenomic era. Nevertheless, new approaches based on some global features of omics data sets need to be explored in order to make classification and comparison of large-scale data sets easier. For proteomes, this may be achieved, for instance, through identification of key parameters or attributes of proteins and comparison of protein distributions within complete proteomes of different organisms or protein sets in terms of such parameters or attributes.

In this paper, we adapt this approach and use two parameters of proteins for the purpose of classifying complete proteomes of different organisms (for simplicity, proteomes) and protein sets: the length of protein amino acid (aa) sequence (protein length L hereafter) and intrinsic disorder content (protein disorder ID hereafter). It had been proposed that the protein sizes, folding rates, and many other physical properties could be associated or even determined by L [1, 2]. At the level of proteomes, previous studies have suggested that the eukaryotic proteomes may exhibit averagely longer L compared to the prokaryotic proteomes [3, 4], even though further analysis may still be necessary. The importance of intrinsically disordered proteins (IDPs) and protein regions (IDPRs) has been recognized [513], and it has been observed that relatively high contents of intrinsic disorders may exist for eukaryotic proteins than for prokaryotic proteins [14]. Moreover, proteins expressed in two eukaryotic organelles, chloroplasts and mitochondria, which evolved from cyanobacteria and alphaproteobacteria, respectively, seem to have a lower disorder content, on average, compared to nuclear-encoded proteins in their host eukaryotes [15]. Interestingly, it has been demonstrated that intrinsically disordered proteins are associated with a variety of human diseases [16, 17], including cancers [18, 19]. As a result, intrinsically disordered proteins have become important targets for drug design [2025]. Thus, understanding intrinsically disordered proteins at the proteomic levels would be of considerable interest. The observations that the distributions of proteins in terms of ID and L may be different for proteomes and for different protein sets suggest that such distributions may be used to classify proteomes of different organisms or protein sets. They may also be used in the future to help understand the properties of proteomes in different disease states, as there seems to be a wide variability of predicted disorder among different diseases [26]. It is interesting to see that a recent study revealed that the overall disorder fractions are positively correlated to the size of the proteomes (estimated by the total aa numbers) and that the disorder fractions of the proteomes of large bacteria (more than 2.5 M aa) are comparable to those of eukaryotes [27].

Here we analyze the protein distributions in terms of L and ID from proteomes of different organisms across the three domains of life, collective data sets of organelles (plasmids, chloroplasts, and mitochondria), and the proteome data of two giant DNA viruses (termed giruses in literature). We noticed that the eukaryotic proteomes do not always exhibit averagely longer proteins than the prokaryotic proteomes. Our observation on protein disorder agrees well with the previous finding, that is, the average disorder contents in eukaryotic proteins are indeed higher than those in prokaryotic proteins. The two-dimensional maps (designated as fingerprints here) based on the protein distribution densities in the LD space defined by ln(L) and ID for the representative proteomes of different organisms and protein sets were constructed, and these fingerprints show distinct patterns for different organisms and protein sets. The features and relationships among the fingerprints are analyzed and compared. To test if our classification of proteomes of different organisms and protein sets proposed here is meaningful, we generated phylogenetic trees based on the protein distribution densities in the fingerprints of proteomes of different organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated in this way were found to be meaningful, as they contain important information of evolution. Thus, the proposed approach may represent a useful and simple way for proteome classification and comparison. In present study, for each protein-encoding gene locus only the prime protein has been used, therefore, the protein densities (Figure 1 and Figure S1) could be regarded as the gene densities. Moreover, using the poplar proteome as an example, it was found that the phylogenies show little difference with or without using alternative splicing proteins (Figure S3). Discussions are made concerning the possibility for extending this approach through introduction of additional attributes.

2. Results

2.1. Protein Distributions in Terms of L and ID

Here, we discuss the proteins (811,600 entries in total) from the proteomes of different organisms and protein sets listed in Table 1, with the protein lengths varying over three degrees of magnitude from 5 (Os06g47230 of rice) to 34,350 aa (titin of human). For the protein length comparison, as pointed out previously [4], the median length is a better measure than the average length to avoid biases from extremely long proteins. Table 1 lists both the median and average lengths of all the proteomes and proteins from gene sets. It should be pointed out that in the present analysis, only the primary protein at each gene locus is selected. This allows a significant simplification of proteome classification. This approximation seems to be reasonable for the main purpose of this work, as there is little difference in the results for the test cases with or without using alternative splicing proteins. Table 1 shows that the eukaryotic proteomes do not always have averagely longer proteins than those in the prokaryotic proteomes, as previously suggested [3, 4]. For instance, the basal flowering plant Amborella trichopoda has a median protein length of 218, shorter than all prokaryotes (Archaea and bacteria) surveyed here. In addition, Giardia intestinalis in the Eukaryota domain has an even shorter median protein length of 147. The average Ls show the same trend as the median values (Table 1).

Nevertheless, the proteins in a eukaryotic proteome do have a significantly higher intrinsic disorder in average (41.1 ± 6.4%) compared to those in a prokaryotic proteome (15.6 ± 6.5%), consistent with previous studies [14, 28]. This trend stands for the average disorder contents of all residues from the proteomes (47.5 ± 6.4% for eukaryotes compared to 32.9 ± 1.4% for prokaryotes). Proteomes from the archaeon N. equitans and bacterium Rickettsiales have the lowest disorder content at the protein level (7.0% for N. equitans and 7.7% for Rickettsiales) for the systems examined. As the smallest known archaeon, N. equitans is an obligate symbiont on the other archaeon I. hospitalis, which is the smallest known free-living archaeon [29]. The free-living alphaproteobacterium Rickettsiales, on the other hand, was suggested to be a living candidate that is close to the ancient endosymbiotic alphaproteobacteria that were merged into an archaeon and eventually transferred into the mitochondria of the first eukaryotic cell [30]. These two symbiotic or presymbiotic organisms have retained more ordered proteins compared to other free-living bacteria and Archaea surveyed here.

Consistent with previous studies [15], the proteins from the mitochondrion (88,405 proteins from 6119 species) and chloroplast (80,807 proteins form 935 species) sets have relatively low disorder contents compared to the proteins encoded in nuclear genes of eukaryotic organisms, for example, the mitochondrial protein set has a considerably lower disorder content of 8.6% at protein level. The mitochondria have lost most of their ancestral genes either by transferring to the nucleus or by being discarded [31]. Here, we show that the mitochondrial proteins have relatively low disorder contents (i.e., highly ordered) at both the protein and the residue levels (Table 1). The genes retained in the mitochondrial genomes have been proposed preferentially to encode core proteins involved in electron transfers [32], and a colocalization of the redox regulation (CoRR) mechanism was proposed to explain why the mitochondrial and chloroplastic organelles retain their own genes, or proteins [33, 34]. Our analysis indicates that the chloroplast genes have their proteins with disorder contents close to the free-living prokaryotes, but higher than those from the symbiotic Archaeon N. equitans and alphaproteobacterium Rickettsiales, as well as the mitochondrial set (Table 1).

The proteomes of two giant DNA viruses (giruses), the Mimivirus and Pandoravirus, were also analyzed. The numbers of proteins encoded in these two giruses are comparable to the prokaryotic proteomes. The disorder content of the proteome of the Mimivirus is larger than that of the prokaryotes, but smaller than that of the eukaryotes surveyed here. However, the Pandoravirus has a proteome with disorder content close to that of the eukaryotes.

Finally, the viral and plasmid gene sets were analyzed. The viral gene set contains 237,463 genes collected from 4942 strains and the plasmid set contains 95,214 genes cultivated from 985 bacteria. Interestingly, the proteins from these two sets yield similar trends in both length and disorder distributions.

2.2. Definition of the LD Space

Consistent with a previous report [3], exponential distributions of the protein lengths (L) in all proteomes and protein data sets have been observed. In this analysis, all proteins of a proteome or protein set have been ranked hierarchically from the shortest to the longest, and the proteins then distribute linearly on ln(L) (the natural base was used for the logarithm function in this study). Similar linear distribution trend is observed for the percentage of residues located in the IDPR, ID (Figure 2). Therefore, a two-dimensional LD space could be defined with one phase for the content of the protein intrinsic disorder, ID, and the other phase for the logarithm of the protein length, ln(L). Figure 2 exemplifies the protein distribution in the LD space of the human proteome.

2.3. Dependency of the Two Attributes for the LD Space

We defined a two-dimensional LD space with the two attributes, ln(L) and ID, and these two attributes need to be independent of each other. Therefore, we calculated the correlation coefficients (CCs) between ln(L) and ID of proteins in all proteomes and protein sets (Figure 3). Pearson’s and Spearman’s CCs for all proteins (811,600 entries, Table 1 and S1) are −0.101 and −0.129, respectively. The overall slight negative CC (anticorrelation) indicates that there may be a trend that shorter proteins have averagely higher disorder contents than the longer proteins. However, the anticorrelational trend does not hold for all species surveyed in this study and positive CC values were found, too, such as in the animals (human and fruit fly) and green algae C. reinhardtii (Table S1). The variations in the correlational trends between ln(L) and ID, therefore, may have been driven by the evolutionary processes rather than a cause-and-effect relationship. As such, the validity of the protein LD space and the related architecture of protein distributions in the LD space (i.e., the “fingerprint”) should be discussed in an evolutionary framework (see below).

2.4. Architecture of Protein Distribution (Fingerprint) in the LD Space

The most thoroughly annotated animal and plant genomes may be those of human (H. sapiens) and Arabidopsis thaliana, respectively. Using proteomes form the two representative animal and plant, the protein distributions of proteomes in the LD space were converted to the protein-density contour maps in Figure 1(a) (see Materials and Methods). As we will show below, this approach may be useful in comparative proteomes/genomics.

At a first glance, the plant proteome has more proteins of medium lengths (~5.7 < ln(L) < 6.4 or ~300 < L < 600) and relatively low disorder contents (ID < 0.3) whereas the animal proteome contains more long and disordered proteins (e.g., L > 600 and ID > 0.5). This may partly explain the slightly positive correlations between ln(L) and ID in the animal proteomes but negative correlations in the plant proteomes. The protein distribution contour maps of other proteomes and gene sets can be found in Figure S1 in Supplementary Materials and have been trimmed in the phylogenetic tree in Figure 4 (see below).

It is straightforward to visualize the differences of these two proteomes using the differential contour in Figure 1(b). The H. sapiens proteome has 657 short proteins (i.e., L < 100 or ln(L) < 4.6), among which 294 (1.5% of all proteins) are considered disordered (ID > 0.5); in the A. thaliana proteome, 888 (3.2%) out of 2292 short proteins are disordered. On the other hand, in the H. sapiens proteome, 1135 (5.6%) out of 2384 long proteins (i.e., L ≥ 1000 or ln(L) ≥ 6.9) are disordered; whereas, in the A. thaliana proteome, 306 (1.1%) out of 1157 long proteins are disordered. Therefore, a significant difference between the animal (H. sapiens) and the plant (A. thaliana) could be recognized as that the former has more long disordered proteins, whereas the latter has more short disordered proteins. This difference shown in Figure 1(b) allows us to narrow down the protein/gene distributions related to the architectural differences between the two organisms.

A recent report also indicates that the overall disorder contents of the A. thaliana proteome are lower than those of the H. sapiens proteome [35], which was attributed that more IDP genes functioning in environmental adaptations may have been enriched in plants [35]. Based on our analysis and the apparent abundance of the short disordered proteins in A. thaliana compared to H. sapiens (Figure 1(b)), we focus on the 888 short (<100 aa, see above) IDP (sIDP) genes of A. thaliana. Among these genes, the GO annotations of 203 sIDPs could not be identified, that is, they may be considered among the “dark matter” of the A. thaliana proteome [36]. However, among the 685 annotated sIDPs (occupying 545 GO terms), only 20 (~0.2% of all sIDPs) with 32 GO annotations were included in the previous analysis showing “enrichment” of 74 GO annotations related to the environmental adaptations in A. thaliana compared to H. sapiens [35]. Based on our analysis, this enrichment might not be significant for the sIDPs. We suggest that it might be possible that in animals and other organism (e.g., the green algae C. reinhardtii), some of the sIDPs had been lost whereas long IDPs were enriched. Here, GO annotations of the plant genes were adopted from the plant comparative genomics database PLAZA 3.0 [37].

2.5. Phylogeny Reconstructed Based on Protein Distribution Densities in the LD Space

As the first test concerning whether our classification of proteomes and protein sets is biologically reasonable, we generated phylogenetic trees based on the protein distribution densities in the fingerprints of proteomes without performing any protein sequence comparison and alignments. Here, aiming to quantify the architectural differences among proteomes, the LD space was divided into M × N blocks and then, the distance between two species A and B was calculated using a Euclidian-type formula based on the protein distributions in all blocks (see (3) in Materials and Methods). In this architectural-distance calculation, no rigorous biological function annotations and/or genomic comparisons using BLAST or other protocols are required.

By dividing the LD space with M = N = 10 (Table 2), the distance matrices for all proteomes including those from giruses (Table 1) were calculated and converted to phylogenetic trees as shown in Figure 5. We also tested the 5 × 5 or 2 × 2 partitioning; the 10 × 10 partitioning of the LD space seems to yield relatively high accuracy (Table S2 and Figure S2 in Supplementary Materials). Nevertheless, some of the key properties are not very sensitive to the M and N values. Several interesting features have been found in the trees that we reconstructed: (1) the eukaryotes are clearly separated from the prokaryotes and (2) plants and animals are grouped together, even the eudicot plants (A. thaliana and P. trichocarpa) and monocot plants (O. sativa and A. comosus) are separated. The tree in Figure 5 correctly puts A. trichopoda before the other plant species and after P. patents. Interestingly, it is consistent with our understanding of the plants-fungi-animals phylogenetic relationships [38] and stays in the framework of the natural classification of three domains of life [39]. Based on the phylogenetic tree, the definition of the protein LD space might be considered meaningful to the proteomes, at least to those chosen in present work.

3. Discussion

To the best of our knowledge, this is the first time to classify proteomes and protein sets based on the protein distribution densities in the LD space (fingerprints), and a detailed comparison with the previous work is therefore not straightforward. Nevertheless, the survey of protein distributions in terms of each of the two attributes is consistent with the work published previously. We noticed that the eukaryotic proteomes do not always exhibit averagely longer proteins than the prokaryotic proteomes. Our observation on protein disorder agrees well with the previous finding, that is, the average disorder contents in eukaryotic proteins are indeed higher than those in prokaryotic proteins. We have also generated phylogenetic trees based on the protein distribution densities in the fingerprints of proteomes, and this allows us to make some comparisons of the results that we obtained here with the knowledge in the field and to examine the consistency and differences with earlier investigations. Such comparison may also provide certain alternative views that were generated through this unique approach.

3.1. Giant DNA Viruses and the Tree of Life

It has been in the debate over the years concerning if viruses should be included in the tree of life [40, 41] or if they are alive at all [42, 43]. The discovery of Mimivirus [44] that belongs to the nucleocytoplasmic large DNA viruses (NCLDV) and the following discoveries of other giant DNA viruses (giruses) [45], for example, the Pandoravirus with a genome size exceeding some of the cellular organisms [46], invoked questions on if a “fourth domain” should be added to the tree of life [46, 47] and potentially important roles that viruses played in eukaryogenesis [48]. Interestingly, we found that Mimivirus is located in between the Eukaryota and prokaryote (Archaea + Eubacteria) branches, that is, at the prokaryote-to-eukaryote transition zone. This is consistent with the original phylogenetic analysis inferred based on seven universally conserved protein sequences [44]. The Pandoravirus, on the other hand, is located within the Eukaryota branch. The vast majority (>93%) of the Pandoravirus genes exhibit no homology to anything known [46]; however, our approach puts it in the same branch of the parasite Giardia (Figure 5(c)), owing to the abundance of short proteins (both in ordered and disordered states) in these two organisms (Figure S1).

3.2. Organelles

The phylogenetic tree with the viral and organelle (mitochondria, chloroplasts, and plasmids) gene sets is shown in Figure 4 along with the fingerprints in the LD space. In this tree, the viral gene set is located in the same branch as the Pandoravirus. The plasmid gene set is located in between prokaryotic and eukaryotic branches, or more accurately, between Mimivirus and Pandoravirus. These results suggest the importance of horizontal gene transfers in eukaryogenesis carried by the viral and plasmid genes.

In Figure 4, the mitochondrial gene set sits in the same branch as the symbiont N. equitans and alphaproteobacterium Rickettsiales, owing to that majorities of the proteins in these proteomes and protein set are highly ordered (Table 1). The chloroplast set is located at the same branch as the viral gene set and Giardia (Figure 4). Using the full set of annotated mitochondrial genomes for 2015 species, a recent report [32] revealed that the proteins retained in the eukaryotic mitochondria are preferentially the structural cores in the electron transportation chains. Our survey with the mitochondrial proteins obtained from the NCBI database indicates that the mitochondrial proteins are mainly structurally ordered (Figure 6(a)), thereby possibly structurally and functionally conserved, too. However, using the model plant species A. thaliana as an example, the mitochondrial protein distribution in the LD space (Figure 6(b)) does not match that from the mitochondrial gene set (Figure 6(a)). This inconsistency may originate from a considerable amount of highly disordered proteins retained in the mitochondria. For instance, A. thaliana has 115 mitochondrial genes, 23 of which are IDPs (i.e., ID ≥ 0.5; here, ID refers to the ratio of residues). However, we found that 19 (out of 23) mitochondrial IDPs have unknown functions involved in unknown biological processes (Table S3 in Supplementary Materials), immediately raising a question on the validity of the results obtained from annotated mitochondrial genomes (Figure 6(a) in the present study and [32]). The protein distribution profile of A. thaliana chloroplast (Figure 6(d)) resembles that of the collective chloroplast gene set (Figure 6(c)). Only 6 out of 85 A. thaliana proteins are IDPs, all of which have been annotated as ribosomal proteins (Table S3).

4. Conclusion

Our two-dimensional contour maps (or proteome fingerprints) based on the protein distribution densities in the LD space show distinct patterns for different organisms and protein sets and may therefore be used for classification of proteomes and protein sets. The phylogenetic trees generated based on the protein distribution densities from the fingerprints were found to be meaningful, as they seem to contain important information of evolution. Thus, the proposed approach and its further extension may represent a useful and alternative way for proteome classification and comparison. It should be pointed out that although in the present work we used protein lengths (L) and protein intrinsic disorder contents (D) as the basic attributes, other attributes (not limited to those from proteins) may be introduced as well. One can imagine that one of the properties for the attributes would be that protein distributions in terms of the new attributes would be different for different proteomes (protein sets) so that the purpose of classification of proteomes (protein sets) can be achieved.

5. Materials and Methods

5.1. Proteomes and Gene Set

The plant proteomes in this study were downloaded from Phytozome, and the proteomes of bacteria, Archaea, and animals were downloaded from UniProt; the organelle protein sets were obtained from NCBI, at or before December 2016.

Here, we surveyed 12 eukaryotic proteomes from two animal species Homo sapiens [49, 50] and Drosophila melanogaster [51], two monocot plant species Oryza sativa L. ssp. indica [52] and Ananas comosus [53], two dicot plant species Arabidopsis thaliana [54] and Populus trichocarpa [55], the basal angiosperm Amborella trichopoda [56], the moss Physcomitrella patens [57], the fungus Saccharomyces cerevisiae strain S288C [58], the green algae Chlamydomonas reinhardtii [59], the metamonada Giardia (previously known as an Archezoa that lacks conventional mitochondrion) [60], and Monocercomonoides sp. PA203 that completely lacks the mitochondrial or mitochondrial-derived genes [61]. We also analyzed three bacterial species Escherichia coli K12 MG1655 [62], the cyanobacterium Synechococcus elongatus PCC 7942 [63], and the alphaproteobacterium Rickettsiales bacterium Ac37b [64] and three Archaea species Ignicoccus hospitalis kin4/i, Nanoarchaeum equitans [29], and Lokiarchaeum sp. GC14_75 [65]. Two giant DNA-viruses (giruses) were also analyzed, including the Mimivirus [44] and Pandoravirus salinus [46]. In addition, we downloaded several gene collections from the NCBI gene libraries containing the viral set (237,463 genes), plasmid set (95,214 genes), mitochondrial set (88,405 genes), and chloroplast set (80,807 genes). Table 1 gives a summary of the proteomes and gene sets.

The proteomes and gene sets listed above comprise 811,600 proteins, among which 2401 proteins (~0.3%) contain unknown “X” residues and were excluded for analysis in this work.

It should be pointed out that in the present analysis, only the primary protein at each gene locus is selected. The poplar (P. trichocarpa) proteome [55] was selected to test the potential influence of the versions of the proteomes and splicing alternatives. From the P. trichocarpa genome, there are three versions (v01, v02, and v03) of the proteomes, of which the v03 proteome has 41,434 primary proteins and 31,579 splicing alternatives (73,013 proteins in total). Using the primary proteins of all three versions and the full proteome of the v03 version as separated entries, a phylogenetic tree was constructed (Figure S3 in Supplementary Materials) and there is little difference with or without using alternative splicing proteins or by using different proteome versions.

5.2. Intrinsic Disorder (ID) Prediction

The PONDR-VSL2 algorithm [66] was applied to predict the ID content of all residues in a protein. This program had achieved ~81% accuracy for both short and long proteins. By default, a residue is in an ordered state if its PONDR score is less than 0.5, but in a disordered state when the PONDR score is larger than or equal to 0.5. PONDR scores of 0 and 1 corresponding to the fully ordered and fully disordered states, respectively. Here, this criterion was adopted and extended to calculate the ID content of a protein: where ND is the number of disordered residues and L is the total number of residues of the protein (i.e., protein length). IDpep is also termed as the “rough definition” of the disorder contents in [27] and ranges from 0 to 1, with 0 and 1 corresponding to the fully ordered and fully disordered proteins, respectively.

It had been suggested that the total proteome information content (PIC) could be defined as the total number of amino acids of the primary proteins (longest isoform at each gene locus) that the proteome carries [67]. In accordance with this definition, we also calculated the average intrinsic disorder content per residue as where Χ is the total number of amino acids and Di is the PONDR score of the ith residue of the proteome or protein set. IDres corresponds to the definition adapted in [27]. Both IDpep and IDres are listed in Table 1. Because in present work distributions of genes (or proteins) are used to discuss the evolutionary dynamics, IDpep (simplified as ID in the main text) had been chosen to act as one of the attributes of the LD space.

5.3. Generation of the Fingerprints and Phylogenetic Analysis

To generate the fingerprints, the LD space of species X was first divided into M × N blocks (e.g., Table 2), M for ln(L) and N for ID. This separation is reasonable because both ln(L) and ID exhibit linearity (Figure 2). Then, the protein density in the ijth block (i in ln(L) and j in ID%) is calculated as , where nij is the number of proteins in the ijth block and is the total number of proteins in the proteome of species X. Normalization of the protein density is realized by default since .

Using the protein densities, the distance between two organisms A and B can be calculated using the Euclidean equation: where rAB is the distance between A and B and Xij (X = A or B) is the protein density in the ijth block. The calculated distance matrix is converted to the phylogenetic tree using the neighbor-joining method by the T-REX web server [68]. M and N and detailed block separations may serve as variables to fine tune the final phylogenetic tree. As a proof of concept, the reconstructed phylogenetic tree using M = N = 10 is shown in Figure 5.

The overall working flow of phylogenetic tree reconstruction is as follows: selection of the proteomes and protein sets → calculations and statistics of the intrinsic disorder contents (ID) and protein length of primary proteins (logarithm, ln(L)) → calculations of the protein densities in all blocks (Table 2) → calculations of the Euclidian distance between each pair of proteomes or protein sets (3) → reconstruction of the phylogenetic tree based on the distance matrix.


The College of Engineering & Computer Science, SimCenter, University of Tennessee Chattanooga, 701 East M. L. King Blvd., Chattanooga, TN 37403, USA, is the current address of Hao-Bo Guo. Oak Ridge National Laboratory is managed by UT-Battelle LLC for the US DOE under Contract no. DE-AC05-00OR22725. A presentation for a part of this work has been given at the Quantitative Biology 2017 Meeting in Beijing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work is supported by the U.S. Department of Energy (DOE), Office of Science, Genomic Science Program, under Award no. DE-SC0008834.

Supplementary Materials

Table S1. Correlation coefficients between ln(L) and ID%. Table S2. Intervals that partition the LD spaces into M × N blocks with M = N = 2 and 5. Table S3. IDPs in the mitochondrion and chloroplast of A. thaliana. Figure S1. Protein-density contour maps (see Figure 1(a) in main text for the scale bar). Figure S2. Phylogenetic trees reconstructed from the protein distributions in the LD space using A—(M = N = 2) and B (M = N = 5). Eukaryotes are in red, prokaryotes (bacteria and Archaea) in blue, and giruses in pink branches. MEGA5 (1) was used to plot the trees. Compared to that of the M = N = 10 tree (Figure 5), the branch length of the M = N = 10 tree is larger. Figure S3. Phylogenetic tree reconstructed from gene densities on the LD space. Different versions (v01–v03) of the P. trichocarpa proteomes have been used. By default of the present work, only proteins from primary transcripts are chosen for all proteomes. Here, for P. trichocarpa proteome v03, we tested both the primary transcripts (41,434 proteins) and all transcripts (73,013 proteins). We show here that progressive improvements including the splicing variants did not make significant changes in the phylogeny. (Supplementary Materials)