BioMed Research International

BioMed Research International / 2019 / Article

Research Article | Open Access

Volume 2019 |Article ID 8702968 | https://doi.org/10.1155/2019/8702968

Marwa A. Abd Elwahaab, Mervat M. Abo-Elkhier, Moheb I. Abo el Maaty, "A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector", BioMed Research International, vol. 2019, Article ID 8702968, 9 pages, 2019. https://doi.org/10.1155/2019/8702968

A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector

Academic Editor: Paul Harrison
Received06 Jan 2019
Revised13 Mar 2019
Accepted07 Apr 2019
Published08 May 2019

Abstract

Similarity/dissimilarity analysis is a key way of understanding the biology of an organism by knowing the origin of the new genes/sequences. Sequence data are grouped in terms of biological relationships. The number of sequences related to any group is susceptible to be increased every day. All the present alignment-free methods approve the utility of their approaches by producing a similarity/dissimilarity matrix. Although this matrix is clear, it measures the degree of similarity among sequences individually. In our work, a representative of each of three groups of protein sequences is introduced. A similarity/dissimilarity vector is evaluated instead of the ordinary similarity/dissimilarity matrix based on the group representative. The approach is applied on three selected groups of protein sequences: beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. A cross-grouping comparison is produced to ensure the singularity of each group. A qualitative comparison between our approach, previous articles, and the phylogenetic tree of these protein sequences proved the utility of our approach.

1. Introduction

Sequence comparison is used to study structural and functional conservation and evolutionary relations among the sequences. The importance of similarity/dissimilarity of biological sequences returns to its relationship with the structures and functions. Proteins with similar sequences usually have similar structures. The rate of addition of new sequences to the databases is increasing exponentially [1]. Comparing these new sequences to those with known functions is a key way of understanding the biology of an organism. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. There are many tools and techniques that provide the sequence comparisons.

Sequence comparison can be classified into alignment-based methods and alignment-free methods [2, 3]. Alignment-based methods assign scores to different possible alignments, picking the alignment with the highest score. Some algorithms do global alignment or local alignment [46]. BLAST [7] and FASTA [8] are the most widely used applications. Alignment-based methods are computationally difficult with multiple sequence alignments at the same time. A wide range of scoring systems has been proposed such as amino acid substitution scoring matrices PAM and BLOSUM for protein alignment [9].

Alignment-free approaches overcome the limitations of alignment-based methods. Graphical representation approaches are one of them. Graphical representations are usually accompanied by numerical characterization and then a descriptor to describe each protein sequence. A similarity/dissimilarity analysis is then done using these descriptors by evaluating Euclidean distance or correlation angle among them. The smallest Euclidean distance or correlation angle is the more similar. Many graphical representations of DNA and protein primary sequences have been proposed. Some other approaches characterize numerically protein sequences without previous graphical representation and nongraphical representation methods [10, 11].

In this article, an alignment-free method is introduced. It is considered a nongraphical representation method. Three groups of protein sequences are selected to illustrate our approach. They are beta globin, NADH dehydrogenase subunit 5 (ND5), and spike protein sequences. They are selected as each group has sequences of similar range of lengths. The most common sequences of each group are selected. The selected sample is assumed to be unbiased and the population distribution of each group is normal. Therefore, the selected sample represents the group. Statistics can be used to estimate the population's parameters. The adjacency vector is introduced as a novel descriptor for protein sequences. It is computed for each sequence in the selected sample of three groups. A reference vector is then computed for each group. This vector acts as a representative of the group. Each sequence’s degree of similarity in each group is measured according to its group’s representative vector. So, a similarity/dissimilarity vector is constructed instead of ordinary similarity/dissimilarity matrix. Our approach is independent of the protein sequence length. It does not require any previous graphical representation. It is a mathematically simple approach.

2. Dataset, Technology, and Tools

The protein sequences used in this article are listed in Tables 1, 2, and 3. The sequences are downloaded from the National Center for Biotechnology Information (NCBI) “https://www.ncbi.nlm.nih.gov/” as FASTA files. These FASTA files are imported into Wolfram Mathematica 8 where all the results and figures are produced. The phylogenetic tree of these protein sequences is also created by the Basic Local Alignment Search Tool (BLAST) “https://blast.ncbi.nlm.nih.gov/Blast.cgi”.


No.SpeciesAccess No.Length

1HumanAAA16334147
2ChimpanzeeCAA26204125
3GorillaCAA43421121
4MouseCAA24101147
5RatCAA29887147
6GallusCAA23700147
7OpossumAAA30976147


No.SpeciesAccess No.Length

1HumanAP_000649603
2GorillaNP_008222603
3Pigmy ChimpanzeeNP_008209603
4Common ChimpanzeeNP_008196603
5Fin WhaleNP_006899606
6Blue WhaleNP_007066606
7RatAP_004902610
8MouseNP_904338607
9OpossumNP_007105602


No.Access No.Class No.AbbreviationLength

1CAB91145ITGEVG1447
2NP058424ITGEV1447
3AAK38656IPEDVC1383
4NP598310IPEDV1383
5NP937950IIHCoVOC431361
6AAK83356IIBCoVE1363
7AAL57308IIBCoVL1363
8AAA66399IIBCoVM1363
9AAL40400IIBCoVQ1363
10AAB86819IIMHVA1324
11YP 209233IIMHVJHM1376
12AAF69334IIMHVP1321
13AAF69344IIMHVM1324
14AAP92675IIIIBVBJ1169
15AAS00080IIIIBVC1169
16NP 040831IIIIBV1162
17AAS10463SARS_CoVsGD03T00131255
18AAU93318SARS_CoVsPC41271255
19AAV49720SARS_CoVsPC41371255
20AAU93319SARS_CoVsPC42051255
21AAU04646SARS_CoVscivet0071255
22AAU04649SARS_CoVscivet0101255
23AAV91631SARS_CoVsA0221255
24AAP51227SARS_CoVsGD011255
25AAS00003SARS_CoVsGZ021255
26AAP30030SARS_CoVsBJ011255
27AAP50485SARS_CoVsFRA1255
28AAP41037SARS_CoVsTOR21255
29AAQ01597SARS_CoVsTaiwanTC11255

Table 1 shows the sample set that consists of seven species of beta globin protein sequences. Their range of lengths is from 121 to 147. This sample set is applied before in [12]. Table 2 shows the sample set which consists of nine ND5 protein sequences. Their range of lengths is from 602 to 610. This sample set is applied before in [1225]. Table 3 shows the sample set which consists of 29 spike protein sequences. Their range of lengths is from 1162 to 1447. These viruses are coronavirus. They are classified into four classes: Class I that includes the porcine epidemic diarrhea virus (PEDV) and the transmissible gastroenteritis virus (TGEV). Class II includes the bovine coronavirus (BCoV), human coronavirus OC43 (HCoV-OC43), and the murine hepatitis virus (MHV). Class III contains the infectious bronchitis virus (IBV). The others are severe acute respiratory syndrome coronaviruses (SARS-CoV). This sample set is applied before in [26].

3. The Adjacency Vector

In this approach, a new vector is suggested to be a descriptor of a protein sequence. This vector is called the adjacency vector (); x refers to the species’ protein sequence and y refers to its related group. It counts the occurrence of all possible pairwise adjacencies obtained by reading the protein primary sequence from left to right. The protein sequence is composed of 20 common different amino acids which are “A,” “R,” “N,” “D,” “C,” “Q,” “E,” “G,” “H,” “I,” “L,” “K,” “M,” “F,” “P,” “S,” “T,” “W,” “Y,” and “V” as ordered alphabetically according to letter code. Therefore, the adjacency vector (Axy) consists of 400 elements. Every 20 elements are related to each amino acid. The first 20 elements are related to “A” amino acid. The second 20 elements are related to “R” amino acid. The third 20 elements are related to “N” amino acid and so on by the same order which is illustrated previously according to letter code. We borrow our idea from the 20 ×20 adjacency matrix [27].

The adjacency vector counts the possibilities of each pair. In other words, it counts the number of times that each pair is repeated along the sequence length. If the pair does not exist, its value in the adjacency vector is zero. For example, to evaluate the adjacency vector of the two short segments of “yeast Saccharomyces cerevisiae” protein [16, 19, 2224, 28]

Protein I: “WTFESRNDPAKDPVILWLNGGPGCSSLTGL”

Protein II: “WFFESRNDPANDPIILWLNGGPGCSSFTGL”

The two protein sequences are composed of 30 amino acids. Protein I is converted to 29 adjacent pairs that are WT, TF, FE, ES, SR, RN, ND, DP, PA, AK, KD, DP, PV, VI, IL, LW, WL, LN, NG, GG, GP, PG, GC, CS, SS, SL, LT, TG, GL as reading sequence from left to right. Protein II is converted to 29 adjacent pairs that are WF, FF, FE, ES, SR, RN, ND, DP, PA, AN, ND, DP, PI, II, IL, LW, WL, LN, NG, GG, GP, PG, GC, CS, SS, SF, FT, TG, GL as reading sequence from left to right. For example, “ND” pair has a count one in protein I and two in protein II. “DP” pair has a count two in both protein I and protein II. “SL” and “LT” pairs have a count one in protein I and zero in protein II.

Our approach is applied on three selected groups of protein sequences. The groups are beta globin, ND5, and spike protein sequences as illustrated in Tables 1, 2, and 3, respectively. The most common protein sequences are selected in each group. The selected sample is assumed to be unbiased and the population distribution of each group is normal. Therefore, the selected three samples can represent the three groups. The samples consist of seven beta globin, nine ND5, and 29 spike protein sequences.

Seven adjacency vectors for beta globin proteins, nine adjacency vectors for ND5 protein sequences, and 29 adjacency vectors for spike proteins are evaluated. For example: (1) Human (beta globin) protein sequence’s first 20 elements of its adjacency vector () are as shown in Table 4. (2) Gorilla (ND5) protein sequence’s last 20 elements of its adjacency vector () are as shown in Table 5.


AAARANADACAQAEAGAHAIALAKAMAFAPASATAWAYAV

10100001403001001012


VAVRVNVDVCVQVEVGVHVIVLVKVMVFVPVSVTVWVYVV

50001101000103112001

4. The Group Representative Vector

The adjacency vector is used to describe each protein sequence individually in its corresponding group. This article provides a descriptor to the group itself. The median vector is selected to play the role of the group representative (GRy); y refers to its group. It acts as a reference vector for each group. The median is a better measure of central tendency. It separates the higher half from the lower half of the sample’s data. It is not sensitive to extreme values like average.

The suggested group representative vector (GRy) is a vector which is composed of also 400 elements. Each element of 400 is the median of the corresponding elements in all adjacency vectors related to its sample that represents the group. Beta globin, ND5, and spike protein sequences’ representative vectors are computed. For example:(1) Beta globin representative vector’s () 20 elements are as shown in Table 6.(2) ND5 representative vector’s () last 20 elements are as shown in Table 7.(3) Spike proteins representative vector’s () 20 elements are as shown in Table 8.


AAARANADACAQAEAGAHAIALAKAMAFAPASATAWAYAV

10000011302001001002


VAVRVNVDVCVQVEVGVHVIVLVKVMVFVPVSVTVWVYVV

2.50000001000.5103.5013.5001


AAARANADACAQAEAGAHAIALAKAMAFAPASATAWAYAV

91352436076213647153

5. Similarity/Dissimilarity Analysis

A similarity/dissimilarity vector is introduced instead of the regular similarity/dissimilarity matrix [10, 11]. The similarity/dissimilarity matrix is a square symmetric matrix with zeros in its main diagonal. In order to evaluate this matrix, it is required to measure the degree of similarity between each protein sequence and others in the same group. If the row represents human and the row represents gorilla, the similarity of all species according to human in row is measured. Then the similarity is measured again of all species in row according to gorilla and so on. The calculations’ number of this matrix equals where n is the number of compared species.

The similarity/dissimilarity vector is suggested to save time and number of calculations. It is a vector that has a number of elements equal to the number of protein sequences in the selected sample of each group. It measures the degree of similarity between each protein sequence’s adjacency vector and the group representative vector. In other words, it measures the degree of similarity between each protein’s descriptor and the “group representative.” It is simpler than previous matrix. It is calculated only one time for each sequence. The calculations’ number of this vector equals n where n is the number of compared species.

To measure the degree of similarity, we suggest two methods:

(i) The 1st Method. Evaluate the magnitude of the difference between each protein sequence’ adjacency vector (Axy) and the group representative vector (GRy) of its sample as in

(ii) The 2nd Method. Compute the angle between each sequence’s adjacency vector (Axy) and the group representative vector (GRy) in radians by

For beta globin protein sequences, seven species are selected in our sample set: human, chimpanzee, gorilla, mouse, rat, gallus, and opossum, as illustrated in Table 1. There are seven adjacency vectors corresponding to them. The group representative is evaluated based on these seven adjacency vectors. Therefore, the similarity/dissimilarity vector has seven elements. The element corresponds to human, element corresponds to chimpanzee, and so on, by the same order as in Table 1. In the similar manner, the ND5 similarity/dissimilarity vector and the 29 spike similarity/dissimilarity vector have nine elements and 29 elements as shown in Tables 2 and 3, respectively.

The similarity/dissimilarity vectors that are corresponding to beta globin, ND5, and spike protein sequences are illustrated in Tables 9, 10, and 11, respectively, based on the two methods discussed before.


No.Species() rad.

1Human0.55680.3657
2Chimpanzee0.55680.4098
3Gorilla0.55680.4185
4Mouse0.86020.6047
5Rat0.91650.6251
6Gallus1.05360.7480
7Opossum1.11360.7955


No.Species() rad.

1Pigmy chimpanzee1.25300.2218
2Common chimpanzee1.31910.2357
3Human1.38560.2517
4Gorilla1.38920.2547
5Fin Whale1.53950.3006
6Blue Whale1.54590.3003
7Mouse2.03720.3873
8Rat2.15170.4130
9Opossum2.33670.4659


AbbreviationClass no.() rad.

1TGEVGI4.52660.4793
2TGEVI4.52660.4793
3PEDVCI4.14130.4473
4PEDVI4.14130.4473
5HCoVOC43II3.75370.4299
6BCoVEII3.73770.4203
7BCoVLII3.75500.4233
8BCoVMII3.72160.4198
9BCoVQII3.72160.4203
10MHVAII3.70950.4395
11MHVJHMII4.11830.4728
12MHVPII3.56510.4240
13MHVMII3.70140.4406
14BVBJIII3.96990.5002
15IBVCIII3.89360.4863
16IBVIII4.12430.5188
17GD03T0013SARS-CoVs1.98240.2439
18PC4127SARS-CoVs2.00750.2473
19PC4137SARS-CoVs2.02240.2491
20PC4205SARS-CoVs2.00990.2476
21civet007SARS-CoVs2.04690.2519
22civet010SARS-CoVs2.01250.2478
23A022SARS-CoVs2.05180.2526
24GD01SARS-CoVs1.98240.2445
25GZ02SARS-CoVs1.97230.2433
26BJ01SARS-CoVs1.95700.2413
27FRASARS-CoVs2.01250.2481
28TOR2SARS-CoVs1.99490.2458
29TaiwanTC1SARS-CoVs1.98750.2449

The results in Table 9 show that the magnitude (, where x: species) cannot measure the similarity/dissimilarity degree well among all beta globin sequences. The human, chimpanzee, and gorilla have the same value that is equal to 0.5568, while the similarity is well measured between mouse and rat. Also, the dissimilarity between opossum and human is very clear. The angle () is successfully measured similarity/dissimilarity among all the species as shown in Figure 1. The closest values of both and mean more similarity.

The results in Table 10 show that both the magnitude () and the angle () can measure similarity/dissimilarity degree well among ND5 protein sequences as shown in Figure 2. It is obvious that pigmy chimpanzee, common chimpanzee, human, and gorilla are very similar. Also it shows the similarity of the blue whale, fin whale, and the mouse and rat as pairs and the dissimilarity between human and opossum. These results are satisfied with [13, 14, 16, 18, 19, 2125].

The results in Table 11 show that both and classified the 3 classes of viruses and SARs_Covs well each as a single coherent class except only the “MHVJHM” virus. This virus belongs to class II but our approach cannot classify it well. The classification of 29 spike proteins into classes by our approach is illustrated in Figure 3. The MHVJHM virus is the only wrong classified sequence. It is colored red. Despite the wrong classification of MHVJHM virus, our approach corrects the broken classification of Class I in [26].

According to the results in Tables 9, 10, and 11, the angle is preferred to be used as shown in Figures 1, 2, and 3.

6. Cross-Group Comparison

The group representative vector () carries the information of its group. A cross-group comparison is done to prove the singularity of each group. Tables 9, 10, and 11 are evaluated based on the group’s sample set of protein sequences related to their corresponding group representative vector. Tables 12, 13, 14, and 15 are evaluated based on each group sample set of protein sequences with another group representative vector. The similarity/dissimilarity analysis among the seven beta globin sequences measured according to () is illustrated in Table 12 and shown in Figure 4. The similarity/dissimilarity analysis among the ND5 sequences measured according to () is illustrated in Table 13 and shown in Figure 5. The similarity/dissimilarity analysis among the beta globin sequences measured according to (GRspike) is illustrated in Table 14 and shown in Figure 6. The similarity/dissimilarity analysis among the ND5 sequences measured according to (GRspike) is illustrated in Table 15 and shown in Figure 7. The results show a big distortion that ensures the individuality of each group.


No.SpeciesDxyxy

1Human1.385640.251674
2Chimp4.715931.20638
3Gorilla1.389240.254656
4Mouse2.037150.387323
5Rat2.151740.41301
6Gallus4.532111.08994
7Opossum2.336660.465884


No.SpeciesDxyθxy

1Pigmy chimp5.169141.20525
2Common chimp5.141011.18598
3Human5.123481.19282
4Gorilla5.073461.1745
5Fin whale4.822861.16274
6Blue whale4.866211.17307
7Mouse5.124451.2454
8Rat5.073461.23689
9Opossum4.817681.23466


No.SpeciesDxyxy

1Human6.026610.839369
2Chimp7.524631.06606
3Gorilla6.10.852902
4Mouse6.187890.869323
5Rat6.184660.8689
6Gallus7.448491.04124
7Opossum6.326140.896635


No.SpeciesDxyxy

1Pigmy chimp6.072070.847581
2Common chimp5.996670.833859
3Human6.026610.839369
4Gorilla6.10.852902
5Fin whale6.000830.833717
6Blue whale5.972440.828506
7Mouse6.187890.869323
8Rat6.184660.8689
9Opossum6.326140.896635

7. A Qualitative Comparison between Our Results and the Phylogenetic Tree of Protein Sequences

The phylogenetic tree is a branching diagram showing the evolutionary relationships among various biological species based upon similarities and differences in their sequences. A qualitative comparison between our results and the phylogenetic tree of protein sequences is used to prove the utility of our approach. The matching between the results and phylogenetic trees means matching with the naïve measure of sequence similarity (sequence homology).

The basic local alignment tool (BLAST) is used to draw the phylogenetic trees. The phylogenetic trees of beta globin’s seven species, ND5 nine species, and 29 spike protein sequences are illustrated in Figures 8, 9, and 10, respectively.

The qualitative comparison of the results of Tables 9, 10, and 11 and Figures 8, 9, and 10 shows the utility of our work especially the angle results.

8. Conclusion

The proposed method is an alignment-independent method. An adjacency vector is suggested as a descriptor of any protein sequence. It does not require any graphical representation. A group representative vector is introduced to represent each group of protein sequences. A similarity/dissimilarity vector is produced instead of the regular similarity/dissimilarity matrix. The similarity/dissimilarity analysis is done by two methods. Our approach is applied on three sample sets of three groups of protein sequences. Each sample has a different range of lengths than the others. Our approach does not depend on protein sequence length. It successfully measured similarity/dissimilarity among different lengths. It is very mathematically simple. A cross-grouping comparison is introduced to prove the singularity of each group. The results approved the utility of our approach compared with previous articles and phylogenetic tree obtained by BLAST program.

9. Future Work

We hope to make the method available to include ambiguous amino acid residues and nonstandard amino acids. We hope also to include the analyses of partial or gapped sequences.

Data Availability

All data is mentioned clearly in the manuscript in Section 2 under the title “Dataset.” In this section, we illustrate the data in three tables: Tables 1, 2, and 3. We also mention in the 1st paragraph of dataset that data are downloaded from “Gene Bank.” All data files are with extension “. fasta”.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Supplementary Materials

It is a figure which summarizes our approach. It is submitted under the name of Graphical Abstract. (Supplementary Materials)

References

  1. C. Yu, M. Deng, and S. S.-T. Yau, “DNA sequence comparison by a novel probabilistic method,” Information Sciences, vol. 181, no. 8, pp. 1484–1492, 2011. View at: Google Scholar
  2. X. Yang and T. Wang, “Linear regression model of short k-word: a similarity distance suitable for biological sequences with various lengths,” Journal of Theoretical Biology, vol. 337, pp. 61–70, 2013. View at: Google Scholar
  3. Q. Dai, X. Guo, and L. Li, “Sequence comparison via polar coordinates representation and curve tree,” Journal of Theoretical Biology, vol. 292, pp. 78–85, 2012. View at: Publisher Site | Google Scholar | MathSciNet
  4. S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970. View at: Publisher Site | Google Scholar
  5. T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981. View at: Publisher Site | Google Scholar
  6. O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular Biology, vol. 162, no. 3, pp. 705–708, 1982. View at: Publisher Site | Google Scholar
  7. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, and J. Mol, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. View at: Publisher Site | Google Scholar
  8. D. J. Lipman and W. R. Pearson, “Rapid and sensitive protein similarity searches,” Science, vol. 227, no. 4693, pp. 1435–1441, 1985. View at: Publisher Site | Google Scholar
  9. S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks,” Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp. 10915–10919, 1992. View at: Publisher Site | Google Scholar
  10. M. Randic, J. Zupan, A. T. Balaban, and D. V. Topic, “Graphical representation of proteins,” Chemical Reviews, vol. 111, pp. 790–862, 2011. View at: Publisher Site | Google Scholar
  11. X. Jin, Q. Jiang, Y. Chen et al., “Similarity/dissimilarity calculation methods of DNA sequences: a survey,” Journal of Molecular Graphics and Modelling, vol. 76, pp. 342–355, 2017. View at: Publisher Site | Google Scholar
  12. C. Li, X. Yu, L. Yang, X. Zheng, and Z. Wang, “3-D maps and coupling numbers for protein sequences,” Physica A: Statistical Mechanics and its Applications, vol. 388, no. 9, pp. 1967–1972, 2009. View at: Publisher Site | Google Scholar
  13. P.-A. He, X.-F. Li, J.-L. Yang, and J. Wang, “A novel descriptor for protein similarity analysis,” Match: Communications in Mathematical and in Computer Chemistry, vol. 65, pp. 445–458, 2011. View at: Google Scholar
  14. A. El-Lakkani and S. El-Sherif, “Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices,” Chemical Physics Letters, vol. 590, pp. 192–195, 2013. View at: Publisher Site | Google Scholar
  15. W. Hou, Q. Panb, Q. Peng, and M. He, “A new method to analyze protein sequence similarity using dynamic time warping,” Genomics, vol. 109, pp. 123–130, 2017. View at: Publisher Site | Google Scholar
  16. J. Wen and Y. Zhang, “A 2D graphical representation of protein sequence and its numerical characterization,” Chemical Physics Letters, vol. 476, pp. 281–286, 2009. View at: Publisher Site | Google Scholar
  17. H. Hu, Z. Li, H. Dong, and T. Zhou, “Graphical representation and similarity analysis of protein sequences based on fractal interpolation,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 1, pp. 182–192, 2017. View at: Publisher Site | Google Scholar
  18. L. Wang, H. Peng, and J. Zheng, “ADLD: a novel graphical representation of protein sequences and its application,” Computational and Mathematical Methods in Medicine, vol. 2014, Article ID 959753, 15 pages, 2014. View at: Publisher Site | Google Scholar
  19. H. Wu, Y. Zhang, W. Chen, and Z. Mu, “Comparative analysis of protein primary sequences with graph energy,” Physica A: Statistical Mechanics and its Applications, vol. 437, pp. 249–262, 2015. View at: Publisher Site | Google Scholar
  20. Y. Li, Q. Liu, X. Zheng, and P. He, “UC-curve: a highly compact 2D graphical representation of protein sequences,” International Journal of Quantum Chemistry, vol. 114, no. 6, pp. 409–415, 2014. View at: Publisher Site | Google Scholar
  21. P. A. He, Y. P. Zhang, Y. H. Yao, Y. F. Tan, and X. Y. Nan, “The graphical representation of protein sequences based on the physicochemical properties and its applications,” Journal of Computational Chemistry, vol. 31, pp. 2136–2142, 2010. View at: Google Scholar
  22. H. Hu, “F-curve, a graphical representation of protein sequences for similarity analysis based on physicochemical properties of amino acids,” MATCH Communications in Mathematical and in Computer Chemistry, vol. 73, no. 3, pp. 749–764, 2015. View at: Google Scholar | MathSciNet
  23. D. Sun, C. Xu, and Y. Zhang, “A novel method of 2D graphical representation for proteins and its application,” MATCH - Communications in Mathematical and in Computer Chemistry, vol. 75, no. 2, pp. 431–446, 2016. View at: Google Scholar | MathSciNet
  24. M. I. Abo El Maaty, M. M. Abo-Elkhier, and M. A. Abd Elwahaab, “3D graphical representation of protein sequences and their statistical characterization,” Physica A: Statistical Mechanics and its Applications, vol. 389, no. 21, pp. 4668–4676, 2010. View at: Publisher Site | Google Scholar
  25. Y.-P. Zhang, Y-J. Sheng, W. Zheng, P.-A. He, and J.-S. Ruan, “Novel numerical characterization of protein sequences based on individual amino acid and its application,” BioMed Research International, vol. 2015, Article ID 909567, 8 pages, 2015. View at: Publisher Site | Google Scholar
  26. P. Ping, X. Zhu, and L. Wang, “Similarities/dissimilarities analysis of protein sequences based on PCA-FFT,” Journal of Biological Systems, vol. 25, no. 01, pp. 29–45, 2017. View at: Publisher Site | Google Scholar
  27. M. Randić, M. Novič, and M. Vračko, “On novel representation of proteins based on amino acid adjacency matrix,” SAR and QSAR in Environmental Research, vol. 19, no. 3-4, pp. 339–349, 2008. View at: Publisher Site | Google Scholar
  28. Y. Yao, F. Kong, Q. Dai, and P. He, “A sequence-segmented method applied to the similarity analysis of long protein sequence,” MATCH: Communications in Mathematical and in Computer Chemistry, vol. 70, pp. 431–450, 2013. View at: Google Scholar

Copyright © 2019 Marwa A. Abd Elwahaab et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views983
Downloads465
Citations

Related articles

We are committed to sharing findings related to COVID-19 as quickly as possible. We will be providing unlimited waivers of publication charges for accepted research articles as well as case reports and case series related to COVID-19. Review articles are excluded from this waiver policy. Sign up here as a reviewer to help fast-track new submissions.