Abstract

To facilitate the intuitional analysis of protein sequences, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced in this paper first, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. Comparing with existing methods, our ADLD based method is proved to be effective in the similarity/dissimilarity analysis of protein sequences and have the merits of good intuition, visuality, and simplicity. The examinations of the similarities/dissimilarities for both the 16 different ND5 proteins and the 29 different spike proteins illustrate the utility of our ADLD based approach.

1. Introduction

Homology analysis is one of the hot topics in the area of protein sequences analysis. Up to now, lots of methods have been proposed for the homology analysis of protein sequences [13], and among them a useful one is the graphical representation of protein sequences, which is proved to be a powerful tool for visual comparison of protein sequences.

At first, graphical representation methods were introduced for representation of DNA sequences on the basis of multiple dimension space [47]. After obtaining the sequence invariants from the graphics, one can compare the sequences based on comparison of sequence invariants. Graphical representation methods were proposed as an alternative approach of direct comparison of DNA sequences, which are computational intensive (even those of a restricted length) [8]. Protein sequences are to some degree similar to DNA sequences, which are composed of different units. Thus the graphical representation methods can be extended to describe protein sequences obviously.

Currently, many researchers have proposed different methods for the graphical representation of protein sequences [924]. For example, Feng and Zhang [25] suggested Zp-curve based on the hydrophobicity and charged properties of amino acid residues along the primary sequence. Randić et al. [26] introduced a graphical representation of protein sequences based on a graphical representation of triplets of DNA in which the interior of a square or a tetrahedron is utilized to accommodate 64 sites for the 64 codons. Bai and Wang [27] derived a 2D graphical representation of protein sequences based on nucleotide triplet codons. Yao et al. [28] outlined a 2D graphical representation of protein sequences based on two classifications of amino acids. Abo el Maaty et al. [29] proposed a novel unique 3D graphical representation of protein sequences based on three physicochemical properties of amino acid side chains. Abo-Elkhier introduced a 3D graphical representation of protein sequence based on a right cone of a unit base and unit height on protein sequences interfaces [30]. El-Lakkani and El-Sherif [31] proposed a graphical representation of protein sequence to help similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Ma et al. [32] introduced a family of Iterated Function Systems (IFS) to outline a 2D graphical representation of protein sequences.

In most of these existing methods, the main drawbacks are that the higher the dimension of the protein sequence graphs, the heavier the computation complexity of the methods or the lower the recognition degree of the protein sequence graphs. For example, in the methods proposed in [26, 28], the main drawback is that the lines will cross each other, which will decrease the visibility of the graphics. In the methods proposed in [2931], the main drawbacks are that the 3D graphics seem to be more complex and have lower visibility than the 2D graphics, and, in addition, to obtain the sequence invariants from the graphics, complex matrixes are required to be constructed, which need much computation and storage.

Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences [33]. Up to now, there are many kinds of algorithms having been implemented for sequence alignment [3437]. These methods are usually efficient but complex and time consuming. Comparing with the alignment methods, existing graphical representation methods can also display the inner structure of the protein sequences and can be utilized to find the similarity/dissimilarity more visible according to their graphics. In this paper, we proposed a novel method for analyzing the similarity/dissimilarity by combining the idea of the sequence alignment and the graphical representation methods to some degree avoid the weakness of both of these two methods.

Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of dimensions, which has been proved to be effective in the process of protein sequence analysis [3840]. Therefore, in order to overcome the main drawbacks of existing methods, in this paper, a novel graphical representation of protein sequences called ADLD (Alignment Diagonal Line Diagram) is introduced based on PCA, and then a new ADLD based method is proposed and utilized to analyze the similarity/dissimilarity of protein sequences. And, in addition, to validate the effectiveness of our ADLD based method, we adopt it to analyze the similarity/dissimilarity of both the 16 different ND5 proteins and the 29 different spike proteins, respectively, which are widely used as the test data [1626]. The analysis results show that our method is not only visual, intuitional, and effective in the similarity/dissimilarity analysis of protein sequences but also quite simple, since there are no high dimensional matrixes required to be constructed.

2. Materials and Methods

2.1. Procedure of Our Method for Analysis of Protein Sequences

In this section, we will illustrate the overall procedures of our method for analyzing protein sequences as follows at first.(1)Select the same 9 different properties for each amino acid and construct a 20 × 9 matrix as the input data of the PCA algorithm on the basis of total 20 different amino acids.(2)According to the PCA algorithm, we can obtain a unique feature for each amino acid.(3)For each protein sequence in the test data, we will replace each amino acid in the protein sequence with its corresponding unique feature, and then we can transform the protein sequence into a numerical sequence.(4)For any two numerical sequences, we can draw a graph, named ADLD, and then abstract some numerical characteristics of it, which can be utilized to analyze the similarity/dissimilarity of these two sequences.

Next, in Sections 2.22.6 we will introduce the details of constructing the ADLDs and obtaining some of the numerical characteristics of them. In Section 3.1, we will give the method for constructing the similarity/dissimilarity of our test sequence groups.

2.2. Amino Acids and Their Properties

Proteins are composed of 20 different amino acids, and these amino acids have many different physicochemical and biological properties such as the molecular weight (), hydropathy index (), the value for terminal amino acid groups COOH (), the value for terminal amino acid groups (), isoelectric point (), solubility (), the number of triplet codons (), frequency of human proteins (), and van der Waals radius of side chains (). The names and symbols of the 20 amino acids and the value of their 9 major properties are illustrated in Table 1.

2.3. Principal Components Analysis

Principal components analysis (PCA) is a common technique for dimensionality reduction and pattern recognition in datasets of high dimension [41]. The main purposes of PCA are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information. The general steps of conducting PCA are as follows.

Step 1. For samples , suppose that each has components , let for , and then construct an matrix according to the following formula first:
Next, based on the matrix , construct the corresponding standardized matrix according to the following formula: where ), , , and , for and .

Step 2. Based on the matrix , construct the correlation matrix according to the following formula: where we can find that for and .

Step 3. From the correlation matrix , obtain its eigenvalues and the corresponding eigenvectors respectively. And, from now on, we can obtain principal components for as follows:

Step 4. For each principal component for , obtain its contribution rate and accumulated contribution rate according to the following formulas, respectively:
Generally, in order to lower the computation complexity, we can keep only the first () principal components , where the accumulated contribution rate of the th principal component shall satisfy the fact that %.

Step 5. For , let Then, for each , we can obtain the total score of the th sample as follows:

2.4. PCA of the Amino Acids

Observing Table 1, if we consider the 20 amino acids as 20 different samples and the 9 properties of each amino acid as its 9 components, then, according to the general steps of conducting PCA illustrated in Section 2.3, we can obtain a 20 × 9 matrix and its standardized matrix , a 9 × 9 correlation matrix , and 9 principal components . And, therefore, as illustrated in Table 2, we can obtain the 9 eigenvalues of and the contribution rates and the accumulative contribution rates of the 9 principal components , respectively.

From Table 2, we can see that the accumulative contribution rate of the first 4 principal components amounts to 0.8588 (=85.88%), which is already bigger than 85%. Therefore, we can keep the first 4 principal components only. Let be the 4 eigenvalues corresponding to the first 4 principal components, respectively; then, as illustrated in Table 3, we can obtain the 4 eigenvectors corresponding to the 4 eigenvalues separately.

Based on Table 3, we can obtain the first 4 principal components as follows:

Observing the above 4 formulas, it is easy to find that there are three big coefficients in the first formula, which are 0.5036 (corresponding to ), 0.4377 (corresponding to ), and 0.4349 (corresponding to ), respectively. Therefore, it means that the three properties such as , , and will have a major role in the first principal component . Similarly, we can also know that the three properties such as , , and will have a major role in the second principal component , the third principal component is mainly determined by and , and the fourth principal component is closely linked with and and so forth. Hence, we can obtain the total scores of the 20 amino acids as illustrated in Table 4 according to formula (9).

2.5. Numerical Sequences of Protein Sequences

Let and suppose that represents a protein sequence with amino acids, where for ; then we can obtain a numerical sequence corresponding to the protein sequence through replacing each amino acid in with its corresponding value of for .

For example, consider the following 3 abbreviated protein sequences:Hu = MTMHTTMTTL,Gor = MTMYATMTTL,Opo = MKVINISNTM.

According to the above descriptions and Table 4, then we can obtain their corresponding numerical sequences as follows:

2.6. ASDs and ADLDs of Protein Sequence Pairs

For a given protein sequence pair (), suppose that the protein sequence includes amino acids, includes amino acids, and ; then, in order to measure the similarity/dissimilarity between them, in this section, we will present a new method called Alignment Scatter Diagram (ASD) to plot the two sequences into a scatter diagram first. And, for convenience, we call the points in the ASD the alignment-plots (APs). The ASD of the protein sequence pair () can be obtained through the following steps.

Step 1. According to the method given in Section 2.5, translate the protein sequence pair () into two numerical sequences with the same length as follows:

Step 2. Let be the alignment width (AW) of the protein sequence pair (); that is, let , ; then, for any amino acid in the protein sequence , we will compare it with these amino acids in the protein sequence , and then can be simply defined as follows: where is a given threshold to guarantee that the AW of the protein sequence pair () will not be too small to expose the association of the inner structures of the protein sequence pair (). In actual applications, we suggest that shall be no less than 10.

Step 3. Let be the dissimilarity degree (DD) of two amino acids; that is, if , then it means that the two amino acids are the same; otherwise, it means that the two amino acids are different from each other to some degree, and then the APs in the ASD of the protein sequence pair () can be briefly defined as follows: where , , and is a Heaviside function, which can be defined as follows:
Thereafter, we can obtain an alignment matrix () as follows:

Step 4. For the elements in the alignment matrix , we can plot points on - plane for these elements in the with and . And, for convenience, we call the obtained graph the Alignment Scatter Diagram (ASD) of the protein sequence pair ().

For example, considering the three β-globin protein sequences of chimpanzee [GenBank: AAA16334.1], human [GenBank: CAA26204.1], and gorilla [GenBank: CAA43421.1] obtained from the GenBank, respectively, we illustrate the ASDs of the β-globin protein sequence pair (chimpanzee, human) and the β-globin protein sequence pair (human, gorilla) in Figures 1(a) and 1(b) separately while letting .

From Figure 1, it is easy to see that there are lots of disordered points in these ASDs, which will lower the visuality of the ASDs remarkably and obstruct us from distinguishing the similarity/dissimilarity between the protein sequence pairs intuitively while observing these ASDs. Therefore, in order to improve the intuition of the ASD, we will propose a simplified variant diagram of the ASD, which is called the Alignment Diagonal Line Diagram (ADLD).

For convenience, in an ASD, we call its main diagonal line the artery tracks (ATs) and the lines parallelling to its main diagonal line the by-path tracks (BTs), respectively. And, in addition, we define a set consisting with no less than consecutive APs on the AT or BTs as a CAPS, where is a given threshold.

For a given CAPS caps1, if there is no CAPS caps2 satisfying , then we call the caps1 a maximum CAPS. And, for convenience, we call the line formed by connecting all of the APs in a maximum CAPS a similar fragment (SF), and simultaneously we call all of the APs on the AT but not on any SFs the free points (FPs).

Obviously, in an ASD, if keeping all of the SFs and FPs only and omitting all those other APs, then we will obtain a simplified variant diagram of the ASD, and, for convenience, we call it the Alignment Diagonal Line Diagram (ADLD). Apparently, if , then an ADLD will degenerate into an ASD. Therefore, in actual applications, we suggest that will be no less than 2. And, particularly, in order to find more accurate SFs in the ADLD of a protein sequence pair, the longer the protein sequences in the protein sequence pair are the bigger the value of shall be.

For convenience of analysis, in an ADLD, suppose that there are different SFs and different FPs on its AT, different BTs locating above its AT, and different BTs locating below its AT; then we get the following.(1)For these different SFs and different FPs on the AT of the ADLD, we will number these SFs and FPs from left to right and utilize and to represent these SFs and FPs separately. And, in addition, we would also call these SFs on the AT of the ADLD the ASFs.(2)For these different BTs locating above the AT, we will number these BTs from down to up and utilize to represent these BTs separately, and, for these different BTs locating below the AT, we will number these BTs from up to down and utilize to represent these BTs separately.(3)For each , where , suppose that there are different SFs on the ; then we will number these SFs from left to right and utilize to represent these SFs separately. And, in addition, we would also call these SFs on the BTs of the ADLD the BSFs.

According to the above assumptions, in Figure 2, we show the two ADLDs corresponding to the ASDs illustrated in Figures 1(a) and 1(b) while letting . And, in addition, to make the ADLDs more visual and intuitional, in Figure 2, we use the red “” to represent the FPs on the AT and the blue lines to represent the SFs on the AT or BTs.

From Figure 2(a), it is easy to see that there are two SFs in the ADLD of the sequence pair (chimpanzee, human); one is , that is, the line segment from the point to the point , and the other is , that is, the line segment from the point to the point . And, in addition, there are totally 6 FPs in the ADLD, which are , , , , , and , respectively.

Observing Figure 2(b), we can easily find that there are also two SFs in the ADLD of the sequence pair (human, gorilla). But, different from that in Figure 2(a), the two SFs in Figure 2(b) are both ASFs; one is , that is, the line segment from the point to the point , and the other is , that is, the line segment from the point to the point . And, in addition, the two ASFs in Figure 2(b) are separated by one gap, and there exist no FPs or BSFs on the AT or BTs.

Through analysis, we can know that, for a given protein sequence pair, if there exist some deletions or insertions of amino acid segments between the two protein sequences, then there will exist some misalignments of SFs in their ADLD; that is, some ASFs on the AT will be transformed into BSFs on some BTs. And, in addition, if there exist some substitutions of the amino acids between the two protein sequences, then, in their ADLD, there will exist some gaps between two neighboring SFs or FPs on the AT. Furthermore, if there exist some insertions, deletions, or substitutions of the amino acid segments at the end of the two protein sequences, then, in their ADLD, there will exist no SFs or FPs on the AT or BTs.

From the above descriptions, it is easy to know that the ADLD of any given protein sequence pair obtained by our above proposed method reflects some inner and specific differences between these two protein sequences in the given protein sequence pair, which may be useful in the similarity/dissimilarity analysis of protein sequence pairs.

3. Results and Discussion

3.1. Method for Similarity/Dissimilarity Analysis of Protein Sequences Based on the ADLDs

According to the above analysis, we have known that the ADLDs may be useful in analyzing the differences of the inner structures of protein sequence pairs. In this section, we will show how to utilize the ADLDs to analyze the similarity/dissimilarity of a group of protein sequences.

Generally, suppose that there are protein sequences ; then while applying the ADLDs to analyze the similarity/dissimilarity of these sequences, the similarity/dissimilarity matrix of these sequences can be obtained through the following steps.

Step 1. According to the method given in Section 2.5, transform these protein sequences into numerical sequences .

Step 2. For a given protein sequence pair , , , we can obtain their ADLD through adopting the method proposed in Section 2.6, and then we can obtain all of the SFs (including ASFs and BSFs) and FPs in the ADLD. Hence, we can obtain the lengths of these ASFs, the lengths of these BSFs, and the number of these FPs, respectively.

Step 3. Suppose that there are totally different ASFs such as , different BSFs such as , and different FPs such as in the ADLD. And, in addition, for each and , let their length be and , respectively, where and ; then we can define the similarity degree () of as follows:
And, therefore, according to these protein sequences , we can obtain an matching matrix () as follows: where

Step 4. Based on the matching matrix and all of its components , where and , then we can obtain an similarity/dissimilarity matrix () of these protein sequences as follows: where

According to the above steps, we present an example through implementing the ADLDs to analyze the similarity/dissimilarity of 16 ND5 proteins (illustrated in Table 5) while letting and illustrate the results of similarity/dissimilarity matrix in Table 6.

Observing Table 6, it is easy to find that there are some similar pairs such as (c-chim, pi-chim) with the distance 0.0510, (human, c-chim) with the distance 0.0814, (human, pi-chim) with the distance 0.0720, (gorilla, c-chim) with the distance 0.0865, (gorilla, pi-chim) with the distance 0.0833, and (fin-whale, blue-whale) with the distance 0.0324. And, among them, the opossum seems to be a peculiar mammal, since the shortest distance between it and the remaining mammals is more than 0.4023. Obviously, the result is consistent with the fact that opossum is the most remote species from the remaining mammals.

Additionally, gallus seems to be more peculiar than opossum, since the shortest distance between it and the remaining animals is more than 0.4423, which is bigger than 0.4023 (the shortest distance between Opossum and the remaining mammals). Obviously, the result is consistent with the fact that gallus is not a kind of mammal.

Therefore, it is apparent that the results illustrated in Table 6 are wholly consistent with the results of the known fact of evolution. That is to say, our ADLDs based method can be utilized as an effective way to analyze the similarities/dissimilarities of protein sequences.

3.2. The Phylogenetic Tree of the Protein Sequences Based on the ADLDs

A phylogenetic tree is a diagram that is used to represent the evolutionary relationships of organisms that are thought to have a common ancestry, and it is a commonly used tool for researchers in some fields to help them analyze the clustering of different species.

Obviously, only through observing the similarity/dissimilarity matrix illustrated in Table 6, we will find that it is not very convenient to distinguish the similarity/dissimilarity of protein sequences. Therefore, in order to show the similarity/dissimilarity of the protein sequences more vividly and intuitively, according to the similarity/dissimilarity matrix illustrated in Table 6, then we will construct the phylogenetic tree of the above 16 ND5 proteins through adopting the software MEGA 6.06 that is provided by Tamura et al. [41], and the result is illustrated in Figure 3.

From Figure 3, it is obvious that we can not only find out the evolutionary relationships of these 16 ND5 protein sequences visually and intuitively but also know easily that the constructed phylogenetic tree is consistent with the results of the known fact of evolution to some degree.

To further validate the performance of our ADLDs based method, we applied our method to analyze the similarity/dissimilarity of another group of proteins including 29 spike proteins of coronavirus and compared our method with the method proposed by Wen and Zhang [17] based on the above given 16 ND5 proteins and the following 29 spike proteins, respectively. The basic information of the 29 spike proteins is illustrated in Table 7.

For the 29 spike proteins illustrated in Table 7, we construct the phylogenetic tree in Figure 4. Since the spike protein sequences are very long (with more than 1100 amino acids), therefore, during simulation, we set to avoid the effect of noise points.

Generally, coronavirus can always be classified into four classes such as the Group I, the Group II, the Group III, and the SARS-CoVs (Severe Acute Respiratory Syndrome Coronaviruses). And, among these four classes, the Group I includes the Canine coronavirus (CCoV), the Feline coronavirus (FCoV), the Human coronavirus 229E (HCoV-229E), the Porcine epidemic diarrhea virus (PEDV), and the Transmissible gastroenteritis virus (TGEV). The Group II includes the Bovine coronavirus (BCoV), Human coronavirus OC43 (HCoV-OC43), the Murine coronavirus, Mouse hepatitis virus (MHV), the Porcine hemagglutinating encephalomyelitis virus (HEV), and the Rat coronavirus (RtCoV). The Group III contains the Avian infectious bronchitis virus (IBV) and the Turkey coronavirus (TCoV).

From observing Figure 4, it is easy to know that the 29 spike proteins of coronavirus can be perfectly classified into the above four classes by our ADLDs based method.

Finally, for the convenience of comparison, we illustrate the phylogenetic trees of the above given 29 spike proteins of coronavirus and 16 ND5 proteins, constructed by adopting the method proposed by Wen and Zhang [17], in Figures 5 and 6, respectively.

Comparing Figure 3 with Figure 6 and Figure 4 with Figure 5, respectively, it is obvious that the phylogenetic trees obtained by the method proposed by Wen and Zhang are quite unreasonable and not consistent with the known facts of evolution at all. But, on the contrary, the phylogenetic trees obtained by our ADLDs based method are not only quite reasonable but also consistent with the known facts of evolution to some degree. Therefore, there is no doubt that the performance of our method is much better than that of the method proposed by Wen and Zhang.

3.3. The Analysis of Intuition and Visuality of the ADLDs

In Section 2.6, we have stated that the ADLDs of protein sequence pairs are intuitional and visual. In this section, we will further discuss the intuition and visuality of the ADLDs in detail.

From Table 6, we can obtain some similar pairs such as (fin-whale, blue-whale), (pi-chim, c-chim), (Human, c-chim), (cheep, goat), (human, pi-chim), and (hare, rabbit) and some dissimilar pairs such as (human, opossum) and (human, gallus), among the above given 16 ND5 proteins. From these similar/dissimilar pairs, we will choose three pairs including (human, gorilla), (human, opossum), and (human, gallus) as examples to further show the intuition and visuality of the ADLDs of these three protein sequence pairs. The ADLDs of these three similar/dissimilar pairs are illustrated in Figure 7, while letting .

Observing Figure 7, we can clearly find that the total length of all of the SFs in each of these three ADLDs satisfies the total length of all of the SFs in the ADLD of Figure 7(a) > the total length of all of the SFs in the ADLD of Figure 7(b) > the total length of all of the SFs in the ADLD of Figure 7(c). Therefore, we can intuitively identify that the similarity of the proteins in each of these three protein sequence pairs satisfies the similarity of the proteins in the pair (human, gorilla) > the similarity of the proteins in the pair (human, opossum) > the similarity of the proteins in the pair (human, gallus).

Moreover, from Figure 7, we can also intuitively identify that the two protein sequences in the protein sequence pair (human, gorilla) are very similar to each other, since the total length of all of the SFs in the ADLD of Figure 7(a) looks very long. But, on the contrary, we can intuitively identify that the two protein sequences in either the protein sequence pair (human, opossum) or the protein sequence pair (human, gallus) are apparently dissimilar to each other, since both the total length of all of the SFs in the ADLD of Figure 7(b) and that in the ADLD of Figure 7(c) look very short.

And, through statistic, we can know that the actual total lengths of all of the SFs in the ADLDs of these three protein sequence pairs (human, gorilla), (human, opossum), and (human, gallus) are 556, 288, and 248, respectively.

Additionally, observing Figures 2(a) and 2(b), hardly can we distinguish the total length of all of the SFs (including ASFs and BSFs) in the ADLD of Figure 2(a) and that in the ADLD of Figure 2(b), since the total lengths of all of the SFs in these two ADLDs look nearly the same. And, through statistic, we can know that the actual total lengths of all of the SFs in the ADLDs of Figures 2(a) and 2(b) are 123 and 120, respectively, and are really close to each other. But, through comparing Figure 2(a) with Figure 2(b) more carefully, we can further discover that, different from Figure 2(b), except for the SFs, there are also 6 different FPs in the ADLD of Figure 2(a), while there are no FPs in the ADLD of Figure 2(b); therefore, we can intuitively identify that the two protein sequences in the protein sequence pair (chimpanzee, human) are more similar to the two protein sequences in the protein sequence pair (human, gorilla).

Hence, from the above descriptions, we can know that the ADLDs obtained by our newly proposed method are quite visual and intuitional and may be a powerful and effective tool for visual comparison of protein sequences and numerical sequences in other research fields.

4. Conclusions

In this paper, a novel ADLDs based graphical representation of protein sequences is proposed, which is utilized to analyze the similarity/dissimilarity of protein sequences. To validate the performances of the new method, we select two groups of well-known protein sequences as examples, and, additionally, in order to observe the similarity/dissimilarity of protein sequences more intuitively, we construct the phylogenetic trees of protein sequences. The results show that our ADLDs based method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation, since there are no high dimensional matrixes required. Therefore, it means that our ADLDs based method can work well in the analysis of protein sequences.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank the anonymous referees for suggestions that helped in improving the paper substantially. And the project is partly sponsored by the Colleges and Universities Open Innovation Platform Fund of Hunan Province (no. 13K041), the Hunan Provincial Natural Science Foundation of China (no. 14JJ2070), the Construct Program of the Key Discipline in Hunan province, the State Education Ministry Scientific Research Foundation for the Returned Overseas Chinese Scholars, and the Introduced Talent Start-up Fund Project of Xiangtan University (no. 11QDZ45).