International Journal of Biomedical Imaging

International Journal of Biomedical Imaging / 2011 / Article
Special Issue

Mathematical Methods for Images and Surfaces 2011

View this Special Issue

Research Article | Open Access

Volume 2011 |Article ID 918978 | 15 pages | https://doi.org/10.1155/2011/918978

Protein Surface Characterization Using an Invariant Descriptor

Academic Editor: Guowei Wei
Received06 Jul 2011
Accepted14 Aug 2011
Published22 Nov 2011

Abstract

Aim. To develop a new invariant descriptor for the characterization of protein surfaces, suitable for various analysis tasks, such as protein functional classification, and search and retrieval of protein surfaces over a large database. Methods. We start with a local descriptor of selected circular patches on the protein surface. The descriptor records the distance distribution between the central residue and the residues within the patch, keeping track of the number of particular pairwise residue cooccurrences in the patch. A global descriptor for the entire protein surface is then constructed by combining information from the local descriptors. Our method is novel in its focus on residue-specific distance distributions, and the use of residue-distance co-occurrences as the basis for the proposed protein surface descriptors. Results. Results are presented for protein classification and for retrieval for three protein families. For the three families, we obtained an area under the curve for precision and recall ranging from 0.6494 (without residue co-occurrences) to 0.6683 (with residue co-occurrences). Large-scale screening using two other protein families placed related family members at the top of the rank, with a number of uncharacterized proteins also retrieved. Comparative results with other proposed methods are included.

1. Introduction

The Protein Data Bank (http://www.pdb.org/pdb/home/home.do) (PDB) currently has more than 3000 protein structures classified as uncharacterized or as proteins of unknown function. This is about 5% of the total structures in PDB. The Pfam database was recently reported to contain over 2200 gene families with unknown function [1]. It has been argued that there are even more local regions on the protein structures that are not completely characterized, and whose functions are not known [2]. Therefore, with the increasing rate at which protein structures are being generated, the problem of protein function annotation has become a major challenge in the postgenomic era [35]. The function of a given protein is largely determined by its three-dimensional structure [6]. The specific shape and orientation of a protein in 3D space are key elements that determine how the protein interacts with its environment, and hence the function of the protein. Although related proteins often have similar functions, it is well known that sequence similarity between proteins does not always lead to functional similarity [7, 8]. Even different functions have been observed for structures with the same fold [9]. Conversely, sequences have been observed with low sequence similarity, but highly structural and functional similarity [10]. The trypsin-like catalytic triad [9] is one example of proteins with different folds, but similar functions. A similar argument can be made between sequence and surface, and between surface and fold. While residues on the protein surface typically make up a small percentage of the total residues in a protein, they often represent the most conserved functional elements of the protein [11]. Therefore, analyzing protein structures using information about their 3D surfaces is essential in the quest for protein function annotation, especially in the study of functional similarities between nonhomologous proteins.

At the core of most activities in the analysis of protein structures and protein function is similarity measurement between structures. Such measurements must deal with different levels of structural similarity, arbitrary mutations, deletions, and insertion of residues, local surface similarities, and so forth. When the problem is similarity measurement between protein surfaces, a major issue becomes how the protein surface is represented, and how the representation can be used for the required similarity measurement. Another problem is that of computation. Structure alignment, the basis for most approaches to protein 3D structure analysis is known to be NP-hard [12]. A major difficulty in comparing protein surfaces locally is the problem of matching 3D structures, since structures need to undergo an exhaustive amount of rotation and translation in order to obtain an adequate structural alignment and to perform an accurate matching [8]. Clearly, a method that avoids the step of local structural alignments can have a significant advantage, especially in screening of similar surfaces over a large database.

In this paper, we introduce an invariant descriptor for the characterization of protein surfaces. We then use this characterization to study the problem of classifying proteins into their functional families based primarily on their surface characteristics. This is a challenging problem, but one that is important in the quest for functional annotation of proteins, using information from potentially nonhomologous proteins. We also show how we can use such a descriptor in various related analysis activities, such as in effective retrieval of similar protein surfaces from very large databases, such as the Protein Data Bank (PDB).

2.1. Protein Sequences, Structure, and Surface

Although proteins could vary significantly in their functions and 3D shapes, they also share a general common structure. Proteins are composed of 20-amino acids that are connected via peptide bonds [13]. Each protein is composed of an ordered sequence of amino acids. The order in which these amino acids are connected is called the protein sequence, or the primary structure of the protein [14]. This primary sequence determines the 3D structure of the protein. All proteins are composed of four common structural types: primary structure, secondary structure, tertiary structure, and quaternary structure. The primary structure is simply the amino acid sequence. The secondary structure is formed by patterns of intermolecular bonding of hydrogen and is determined primarily by the location and the directions of these patterns [14, 15]. This is often described in terms of secondary structural elements (SSEs), such as α-helixes, β-sheets, and turns. The overall 3D shape of the secondary structures determines the tertiary structure of the protein. When two or more chains combine to form a larger molecule, the whole structure is called the quaternary structure. Figure 1 shows an example of some of the common protein structural types (the sequence is not included).

A common method for protein function prediction is by annotation transfer from known homologous proteins [17]. Functions of novel proteins can be determined by sequence comparisons, for instance using sequence alignment. When proteins evolve, the protein structure remains more highly conserved when compared to the sequence. Protein sequences change more easily during evolution due to residue mutations, for instance by substitution, insertion, or deletion. Hence, proteins that belong to the same family (homologous proteins) may not be identified using sequences alone. Orengo et al. [17] reported that proteins related to the same family could share fewer than 15% identical residues. The protein structure retains a significant portion of similarity even between distant homologs. In general, the degree of structural or sequence similarity varies substantially between protein families. Some families can handle more changes than others. This so-called structural plasticity [17] has a considerable impact on the functionality of some proteins, or members of a protein family. A consideration of the protein structure and its variability becomes important in such situations for further analysis of functional similarity between proteins.

A classical approach for deriving the protein function is by first determining its 3D structure, which can then provide some ideas about its function [17]. Protein 3D structures provide information about the binding sites, active sites, and how proteins interact with each other, and thus could provide an insight into the function of the protein [17]. How proteins interact with each other and with other molecules (e.g., ligands) is determined primarily by the amino acids on the protein surface [18]. Therefore, knowledge of the protein surface residues could help in a better understanding of what molecules are binding together, and in some cases, why they bind [18]. The protein surface could also provide significant information about protein functions which cannot be easily detected, even in the presence of sequence or fold similarity. Therefore, the analysis of protein surfaces is important in the study of intermolecular interactions. Clearly, advances in our understanding of protein surfaces could have important implications in various biomedical fields, such as personalized medicine, drug discovery, drug design, and so forth.

2.2. Protein Surface Characterization Methods

Given the foregoing, it is not surprising that different methods have been proposed to characterize the protein surface. Popular examples include those based on surface shape distributions [19], Gauss integral [20], Fourier transform [21], spherical harmonics [22, 23], alpha-shapes [2, 24], and Zernike polynomials [7]. Contact maps between protein surfaces were studied in [25], while similarity networks between surface patches from protein binding sites were studied in [26, 27]. Protein surface similarity using varying resolutions of structural data have also been studied, for instance, using medium-resolution Cryo-EM maps in [26] and low resolution protein structure data in [28]. SHARP [29] provides a mechanism to predict protein-protein interaction by analyzing overlapping protein 3D surface regions. SURFACE [5] is a database of protein surface regions that can be useful for annotation.

Much earlier, Jones and Thornton [30] analyzed protein-protein interaction by using surface patches, where patches are defined based on the Cα atoms that have a predetermined accessible surface area, and adhere to defined constraints on the solvent vectors. Each patch is then described using six parameters, namely, solvation potential, residue interface propensity, hydrophobicity, planarity, protrusion, and accessible surface area. Ferrè et al. [5] analyzed locally similar structures by matching surface patches composed of subsets of amino acids. Each residue on the protein surface is represented using a vector joining its Cα atom and the centroid of its side chain atoms. Surface patches are then compared for similarity by comparing the residue vectors for all possible pairs of residues from the query and target surface patch. Matches are determined based on the root mean square distance, and the residue similarity as determined using a standard substitution matrix. The results of using this method on a nonredundant list of protein chains as recorded in the SURFACE database [5], a collection of protein surface regions that can be useful for annotation. Below, we describe three approaches that are more closely related to our work. See [28, 31] for reviews on surface comparison methods.

Distance Distributions
Distances, geometry, and topology have for long been used in the analysis of general protein 3D structures [32]. Residue distances have been used in standard texture-based analysis of 2D textures (distance matrices) formed by the distances between residues in a protein structure [33]. The use of topological invariants, as captured using Gaus integrals for the automated analysis and representation of general protein 3D structures was described in [20]. Much earlier, Connolly [19] proposed the analysis of protein surfaces using the notion of surface shape distributions. Essentially, surface shapes correspond to different geometric configurations defined on the protein surface. Binkowski and Joachimiak [11] proposed the use of surface shape signatures (SSSs) as a method to describe protein surfaces by exploiting global shape and geometrical properties of the surfaces. Shape signatures are computed based on the distances measured between each unique atom pairs on the surface. Distances are then sorted based on which their distributions are generated. With the distributions, the problem of matching between two surfaces is now reduced to that of comparing their distributions. Comparison between two distributions is performed using the Kolmogorov-Smirnov (KS) test. The use of the shape distribution is fast and relatively resilient to scale, rotation, and mirroring. However, the discrimination ability is still a problem, as the SSS tends to lose important surface details.

Zernike Polynomials
Following earlier work by Canterakis [34] on the use of 3D Zernikes for the analysis of general 3D objects, Sael et al. [7] introduced 3D Zernike to the area of protein structural similarity matching. Here, the protein 3D structure is represented as a series expansion of 3D Zernike functions. The triangulated Connolly surface of the protein is computed, and subsequently the protein is placed into a 3D cubic grid and voxelized. Each voxel has a value of 1 or 0, depending on whether the voxel is on the protein surface or in the interior. The 3D Zernike function is then applied to the voxelized 3D protein shape to obtain the 3D Zernike descriptors. Therefore, the problem of comparison of 3D surfaces is reduced to that of comparing two vectors representing the 3D Zernike descriptors for each protein surface. Several distance measures were tried, such as the Euclidean distance, Manhattan distance, and a correlation-based distance defined as the complement of the correlation coefficient between two Zernike descriptors. Venkatraman et al. [23] studied the use of both spherical harmonics and 3D Zernike descriptors in the retrieval of functionally similar proteins. In a more recent work, Sael and Kihara [28] used the Zernike descriptor to study protein surfaces in low resolution data. Computation of the required Zernike polynomials is, however, known to be a major computational huddle [35]. This problem is even worse for the 3D Zernike polynomials needed for protein surfaces. Thus, the required preprocessing before matching is performed may be a problem for indexing and real-time search of large-scale datasets.

Fingerprints
A recent work [8] used the idea of extracting invariant fingerprints from patches on the protein surface. Patches are obtained by generating the dot surface of the protein and constructing a graph to approximate the protein surface. Afterwards, circular patches are generated as a contiguous surface area from a center point, where the radius of the patch is within a predetermined cutoff. Patches are created for each single point on the surface, after which a fingerprint representation of the patch is computed as a geodesic distance-dependent distribution of directional curvature. Geodesic distances are computed from the central vertex in each patch. Comparisons between fingerprints were performed using the average fingerprint similarity score (AFSS) and the direct fingerprint similarity score (DFSS). Final scores are computed after an alignment procedure based on the AFSS. Clearly, computational complexity will be a major problem here, especially given the computation of the patch representation for each vertex on the surface graph (number of vertices is much more than the number of surface residues). The need for a later stage of alignment for the final computation of matching scores only compounds the computational burden (see [12], for example).

The key difference in our method is the use of the local patch descriptors as defined by the distribution of distances between Cα atoms within each surface patch, conditioned on the specific residue at the center of the patch, and the particular residues found within the patch. Our method computes the residue-specific distance distributions, and residue-distance cooccurrences for the protein surface patches using only the Cα atoms on the protein surface. Residues in the interior of the protein are discarded. Unlike the approach in [8], we avoid the time complexity of generating a graph representation of the surface before the surface can be scanned to generate the patches and then compute the distance distribution. Further, ours does not depend on the time-consuming process of initial surface alignment.

3. Methods

We present an invariant descriptor for characterizing protein surfaces. We start with a local descriptor of selected circular patches on the protein surface. For a given surface patch, the local descriptor is computed based on the residue distances from the center of the patch. The descriptor records the distance distribution between the central residue and the residues within the patch, keeping track of the number of particular pairwise residue cooccurrences in the patch. A global descriptor for the entire protein surface is then constructed from the local descriptors by combining information from local descriptors with similar central residues. The proposed descriptor is invariant to rotations of the surface and mirroring.

Using a fixed patch size, we obtain a descriptor for the protein surface, independent of the size of the protein structure. Thus, the descriptor can facilitate the rapid matching of protein chains, and will eliminate the need for the exhaustive alignment of the protein 3D structures. For a given protein structure or protein chain from a database, such as the PDB, the proposed method can be summarize in the following steps:(1)generate the Connolly surface [36] for the protein chain;(2)generate the surface patches and compute the local invariant descriptor for each patch on the surface;(3)compute the global invariant surface descriptor for the protein chain, by combining information from the local patch descriptors; (4)perform surface matching and comparison using the descriptors; (5)classify the protein into its potential functional family, or perform protein surface retrieval using the invariant descriptors.

Figure 2 shows a schematic diagram of the general approach. The method has been applied on three protein families: uracil-DNA glycosylase, estrogen receptor, and cell division protein kinase 2. These are the same protein families used in a recently published work [8]. We also tested on epidermal growth factor (EGF) and cyclooxygenase-2 (COX-2), two protein families that are known to play a role in cancer. Below, we provide more details on the steps enumerated above.

3.1. Surface Generation

For a given protein, we first generate its Connolly surface [36] at a given atomic radii, using the MSMS program [37], based on which the dot surface is generated. This dot surface is stored in a vertex file. We have used a probe radius of 1.4 Å in all our experiments. Next, MATLAB Bioinformatics Toolbox (Mathworks Inc, Natick, Mass, USA) was used to extract the protein chains and to generate the residue coordinates in each chain. In this step, the chains are extracted while preserving the coordinates of the Cα atoms and their respective residue types by extracting the information from the PDB and the vertex files.

3.2. The Invariant Descriptor
3.2.1. Surface Patches

To capture protein structure similarity and to avoid the computational complexity and the time-consuming problem of aligning 3D protein structures, we propose the use of a global rotational-invariant descriptor to represent overlapping patches on the protein surface. A patch is defined as a circular region with a specified radius, centered on the Cα position of a surface residue. For each residue on the surface (the central residue), we construct a surface patch by recording its residue type, and consider all residues within a certain distance threshold ( ) as part of the patch (see Figure 2). Thus, the proposed surface descriptor is composed of 20 distinct descriptions, one for each protein residue type. For the local descriptor, this is constructed from only information from the patch. For the global descriptor, this is constructed by combining information from patches with the same central residue.

The local invariant descriptor for the patch is created by calculating the distribution of distances between the central residue and all other surface residues within the patch. Additionally, the residue cooccurrences within the patch are also recorded as a part of the local descriptor. Each local descriptor is represented in a matrix of size , where the rows correspond to the 20 distinct protein residue types, plus an extra row to describe the summary distance distribution within the patch. The columns represent the individual bins used to capture the distance distributions (total of bins), plus an extra column to represent the summary of the residue cooccurrences. To reduce the computational time and space requirement, unlike in [8] we define patches only for surface residue positions, rather than for each vertex on the dot surface (the number of vertexes is much more than the number of residues). Therefore, for a given chain, the number of local invariant descriptors will be equal to the number of surface residues. Yet, this number can vary from tens to hundreds and sometimes to thousands of surface residues. Using a huge number of local invariant descriptors for one chain to perform matching will be very time-consuming. To further reduce the computational requirements, for a given chain, we compute a global rotational-invariant descriptor by combining the 20 distinct residue-specific descriptors. For a given residue type, the global descriptor is constructed by taking the average of all patch descriptors with a given residue type as the central residue (see Figure 2). We consider three ways to represent and use the global surface descriptor, as explained below.

3.2.2. Distance Distribution (DD2)

The basic idea of using the distance distribution is that similar functional proteins should have a similar distribution of distances between the residues on their surfaces. The patch descriptor captures the distribution in two forms. The first form is a detailed distance distribution between the central residue in the surface patch and each of the other residues on the patch. To achieve this, a uniform distribution of the distances is assumed and the total number of bins is used to estimate the probability distribution of finding a pair of residues at one of the ranges. The second form is the global distance probability distribution. In this form we estimate the probability of observing any given residue within a patch in a particular distance range from the central residue. In this paper, we study the use of the global distance distribution in identifying similar protein surfaces, and possibly proteins with similar functions. Consequently, the question to be answered is, given a central residue of a specific type, what is the distance distribution for the residues around this central residue? That is, we seek , the probability of observing distance between a central residue of type and any other residue. We expect that the distance distribution should be similar for surface patches from functionally similar proteins.

3.2.3. Residue Cooccurrences (RCs)

Given that surface structures are more conserved than sequence over evolution [15, 17], we expect that functionally similar proteins are likely to have similar surface residues, even though the order of such residues may have changed. This intuition is captured using residue cooccurrences on the protein surface. Using the distance distribution globally provides an idea of how the distances from the central residue are distributed in the protein surface patch. However, there is no constraint on, or indication of, which residues are involved in the formation of these distributions. The co-occurrence of a given residue with the central residue is calculated as the number of times the residue occurs on a patch with the same central residue. Thus, the main problem would be to find the probability of observing residue say, , given a central residue, say . Again, we expect the probability , to be similar for protein surfaces from functionally similar proteins. We note that the surface co-occurrence does not depend on the specific distance between the residues involved, as far as is within the patch.

3.2.4. Distance-Residue Cooccurrences (DRCs)

The above have considered the distance and the co-occurrence separately. The DRC combines the general distance distribution (represented as a row vector, in matrix ) and the residue cooccurrences (represented as a column vector, in matrix ) in describing the protein surface (see Figure 2). The residue-distance co-occurrence vector is defined as follows: , where is the concatenation operator and stands for the transpose of . is used to compute the conditional probability , that is, the probability of observing the distance between residues and given that is the central residue in the patch. We expect that the residue co-occurrence ( , or RC) should carry more distinctive functionally relevant information than the general distance distribution ( , or DD2), since surface residue cooccurrences are likely to be more conserved over evolution. By combining both vectors, we can account for both the geometry of the protein surface and the distribution of specific residues within specific distances on the surface. Using both vectors brings in some biological relevance in the analysis and is likely to lead to improved results in the identification of functionally similar protein surfaces.

3.3. Matching and Classification

Given two proteins, say Protein 1 and Protein 2 we characterize them using their global descriptors, say and respectively. In this work, the global descriptor could be the distance distribution (DD2), residue cooccurrences (RCs), or the distance-residue cooccurrences (DRCs).

Distance Distribution
For matching using the distance distribution we create a vector that is composed of the 20 global distance distributions represented by all vectors from each descriptor. is defined as , where are the distance distributions from each residue type on the surface of Protein 1. Repeat the same process for Protein 2 to create . Then we perform matching using the simple Euclidean distance: .

Residue Cooccurrences
For Protein 1 we create a vector that combines the 20 residues co-occurrence vectors (denoted ), defined as , where represents , , and . Similarly, we compute . Matching is performed using the Euclidean distance between and .

Distance-Residue Cooccurrences
Here, we create a vector that is comprised of all of the distance distributions as well as the residue cooccurrences. For Protein 1, we have . Similarly we obtain DRC2 for Protein 2. Again for simplicity, matching is performed using the Euclidean distance. Clearly, other distance measures could be used.

Classification
Having computed the surface descriptors and the distance between protein surfaces using the descriptors, one may be interested in determining whether a given unknown protein belongs to some known protein family. Using some training data, we can compute surface descriptors for the known family, and based on these perform the required classification. Classification is performed using Weka [38, 39], an open-source software for machine leaning that provides a suite of classification algorithms.

4. Results and Discussion

4.1. Datasets and Environment

We performed experiments to test the performance of the proposed protein surface descriptor in two protein structure analysis tasks, namely, classifying proteins into their most likely functional groups, and ranking and retrieval of protein surfaces. We used two datasets for the experiments. DATASET-A contained information from three protein families: uracil-DNA glycosylase, cell division protein kinase 2, and estrogen receptor. This was created by scanning the PDB and selecting the protein structures with protein chains belonging to one of the three families. We were able to extract 416 chains that belong to 243 proteins in the PDB. The dataset is distributed as follows: 91 chains from 46 distinct proteins for uracil-DNA glycosylase (Group1), 186 chains from 95 distinct proteins for estrogen receptor family (Group2), and 139 chains from 102 distinct proteins from cell division protein kinase 2 (Group3). We used DATASET-A basically to train the system, and perform initial testing. DATASET-B contained protein structures from two families, namely cyclooxygenase-2 (COX-2) (51 proteins, 95 chains) and epidermal growth factor (EGF) (67 proteins, 71 chains). We then extracted protein structures from the PDB that have 10 or less chains and ignored the rest. This resulted in a total of 15,386 protein chains form 6,261 unique proteins. DATASET-B included all structures in DATASET-A. We used DATASET-B for a more comprehensive scan of the PDB, in the quest for potentially novel structures that may be related to the two families. Experiments were performed using a SONY VAIO personal computer, with Intel Core 2 Duo Processor T8100, running at 2.10 GHz, with 2 GB of main memory. Programs were written using Matlab (Mathworks Inc, Natick, Mass, USA) with the Bioinformatics Toolbox. We set probe radius = 1.4 Å and patch distance threshold  Å. For distance distributions, we used a fixed number of bins, . Classification was performed based on algorithms implemented in Weka [38, 39] version 3-6-4.

4.2. Classification Performance

We divide DATASET-A into training and testing sets and apply different classifiers on the different descriptors proposed. In all our experiments, the training sets were kept very separate from the testing sets, with no overlap between the two. Classification performance is measured in terms of classification rate based on the three protein families in the dataset. We tested the method using various classifiers implemented in Weka, such as Naïve Bayes, logistic regression, and simple logistic classifier. We report results mainly for the logistic regression. First, we explore the impact of the size of the testing set and of the training set on the classification performance using the proposed approach. We varied the size of the training set (from 50 to 300), while keeping the size of the testing set fixed. We then checked the performance using fixed testing sets of size 100, 200, and 300. Figure 3 shows the results.

The figure shows that applying the distance distribution (DD2) alone resulted in the lowest performance accuracy as compared to using the residue cooccurrences (RCs) or distance-residue cooccurrences (DRCs). Yet, our definition of the distance distribution shows encouraging results. A steady improvement in performance with increasing training set size can be observed when using DD2 alone, peaking at about 87% with a training size of 200 and testing size of 100. The distinctiveness of our approach is the use of residue cooccurrences on the protein surface. This approach assumes that functionally similar surface proteins have similar residue cooccurrences within a small local surface region. Figure 3 (middle plot) shows that classification using residue cooccurrences (RCs) provided a significant improvement in the classification rate. A similar improvement was observed using other classifiers, such as Naïve Bayes. Using the RC descriptor, we can achieve an accuracy rate of 94% using a small training set (50 samples) and six times larger testing set (300 chains). This shows the robustness of the residue cooccurrences, even when using a few training samples. We observe that the performance using DD2 was not as robust (about 81% using small training set, peaking at about 87% using 200 training samples).

The use of distance-residue co-occurrence presents a steadier improvement in the classification rate. Using the DRC raised the accuracy rate to 99% using the simple logistic classifier on a training set of 150 and testing set of 100 (data not shown). We can observe the significant difference between the results of DD2 (which did not use information on residue cooccurrences) and RC and/or DRC (both of which used residue cooccurrences). Figure 4 shows a corresponding performance measurement with varying size of the testing set, while keeping the training set size fixed. As expected, there is a general slight decrease in performance with increasing size of the test set. The case of DRC using a training set size of 100 seemed to increase slightly with increasing testing set size. The increase is however within a small range (from 0.91 to 0.93). This shows a steady performance over increasing size of the testing set. Overall trends are similar to Figure 3, with RC and DRC performing much better than DD2. Similar trends were also observed using other classification algorithms. The overall classification performance is summarized in Figure 5, which shows the results of the three proposed schemes using -fold cross validation, for different values of .

4.3. Ranking and Retrieval

In this section, we explore the effectiveness of our approach on the problem of search and retrieval of protein surfaces. Given a query protein, we study whether our approach has the robustness to place most of the functionally similar proteins in the top hits of the retrieved surfaces. Here, a query protein from each of the three groups is used to screen the entire DATASET-A (416 samples) and provide a ranking based on the similarity. Thus, each protein structure is ranked against the query, (from 1 to 416), where a lower rank (smaller distance) implies more similarity to the query. After that, we search over the retrieved proteins to find which ranks the functionally similar proteins (i.e., proteins in the same functional group) have attained. Table 1 shows the ranking produced using the proposed descriptor, for three query samples, one for each group. Results are shown only for DRC. RC produced a slightly better ranking (especially for uracil-DNA glycosylase family (Group 1)), while DD2 was worse than both RC and DRC). Overall, for Group 2 and Group 3, the Top 30 ranked proteins belonged to the corresponding family, while Group 1 was more difficult.

(a) DRC on query protein 1UDI chain I (Group 1)

Protein PDB IDChainRankDistance

1UDII10
2ZHXB22.1306
1LQMB32.3509
1LQGC42.4589
2UUGC52.5353
1EUIC72.5920
2ZHXL82.6104
1UGHI102.6349
2UGIA152.6809
1UGIE162.6872
2ZHXH192.6969
2ZHXD212.7006
2ZHXN222.7017
2ZHXJ232.7141
1UGIG252.7261
1EMJA422.7758
2BOOA452.7808
1UGID472.7868
2OWRB502.7952
2J8XD612.8129
1LQGD702.8263
1Q3FA902.8533
2UUGD992.8675
1UGIA1012.8689
2OWRC1102.8749
2ZHXF1162.885
2OWQB1292.8977
1SSPE1412.9115
1UGIC1422.9117
2ZHXA1472.915

(b) DRC on query protein 1QKN chain A (Group 2)

Protein PDB IDChainRankDistance

1QKNA10
2J7XA21.3769
2J7YA31.5753
1QKMA41.6793
1NDEA51.7368
2GIUA61.7371
1L2IA71.7460
1U3RB81.7670
3ERDA91.7683
3OS9A101.7715
2IOGA111.7738
3LTXC121.7742
1YIMA131.7854
3ERTA141.7966
1U3QD151.8009
1YY4A161.8090
2OUZA171.8126
1YINA181.8152
1XP6A191.8260
2AYRA201.8269
3OS8D211.8311
2QH6A221.8312
3OSAA231.8367
1L2JA241.8385
2JJ3A251.8438
1G50A261.8490
3OS8A271.8509
2FSZA281.8518
2QGWA291.8604
1UOMA301.8683

(c) DRC on query protein 1YKR chain A (Group 3)

Protein PDB IDChainRankDistance

1YKRA10
2UZOA21.0396
2R3OA31.0810
3PXYA41.0846
3PY1A51.0940
2WMAA61.1389
2IW6A71.1609
3NS9A81.2043
3IGGA91.2141
2WFYA101.2179
2C5YA111.2270
3DDPA121.2280
2J9MA131.2284
2R3JA141.2374
2R3LA151.2402
3PXRA161.2422
2DUVA171.2534
1W8CA181.2586
3DOGA191.2793
2V22A201.2822
2R3PA211.2883
2V22C221.2963
3IG7A231.3207
2JGZA241.3275
2R64A251.3303
2WHBA261.3381
2VTNA271.3458
3LFNA281.3476
2WIPA291.3514
2BKZA301.3580

We further measured the performance of our approach using the enrichment plot. The enrichment plot essentially measures how well a given ranking or retrieval system performs, when compared with a random selection of the data samples. At a given percentage of database screening, the enrichment factor is computed as the ratio , where of functionally similar proteins observed or retrieved by the system, and of functionally similar proteins expected by random selection. For an effective system, we expect that most of the functionally similar proteins should be observed after a small percentage of screening. That is, the top hits should contain mainly functionally similar proteins, and hence the enrichment factor should be high after a small percentage probe of the database, and gradually decrease towards 1 (which corresponds to random selection). Figure 6(a) shows a plot of the average enrichment factor using 5 queries from Group 3. The enrichment plot shows that our proposed method provides better results as we screen a small percentage of the dataset. In most of the cases, our method retrieved about three times better than the expected random retrieval in the first 10% of screened proteins. As we increase the percent of screening, the retrieval degrades, since we are more likely to have retrieved most, if not all of the similar proteins after a small percentage of the screening. Thus, subsequent retrievals will lead to spurious results.

4.4. Screening Protein Surfaces in PDB

Encouraged by the results in classification and ranking using the proposed descriptors, we now performed a larger scale experiment, by screening the entire protein structures in PDB, using the protein chains in DATASET-B, with members of the COX-2 and EGF families as the query. The main objective was to see how the proposed descriptors will perform on a large scale, and to see if the methods could predict potentially novel functional linkages between any of the families and other proteins in PDB. For this task, we used only PDB files with 10 or less chains, and ignored the rest. This resulted in a total of 15,386 protein chains from 6,261 unique proteins. Table 2(a) shows the ranking results produced by screening the PDB files based on the proposed descriptors, using a member of the EGF family as a query. Table 2(b) shows corresponding results using a member of COX-2 family. Results are shown only for the DRC descriptor. Generally, similar results were obtained using RC. We can notice that some of the unknown proteins (annotated as “uncharacterized”) were placed in the Top-50 rankings, implying a possible relationship with the respective families.

(a) Top 50 hits using DRC for a query protein structure from the EGF family on DATASET-B. Annotations in bold correspond to members of the EGF family, predicted proteins, or uncharacterized proteins

ProteinChainDistanceProtein name annotationRank

2a2qL0.0000COAGULATION FACTOR VII1
2firL1.4925COAGULATION FACTOR VII LIGHT CHAIN2
2zp0L1.5545FACTOR VII LIGHT CHAIN3
1wtgL1.5628COAGULATION FACTOR VII4
1wunL1.5844COAGULATION FACTOR VII5
2b8oL1.6305COAGULATION FACTOR VII LIGHT CHAIN6
2zwlL1.6379FACTOR VII LIGHT CHAIN7
2zzuL1.6536FACTOR VII LIGHT CHAIN8
1wqvL1.6816COAGULATION FACTOR VII9
2ec9L1.6832COAGULATION FACTOR VII10
1danL1.7655BLOOD COAGULATION FACTOR VIIA11
1wssL1.7659COAGULATION FACTOR VII12
2puqL1.7692COAGULATION FACTOR VII13
1fakL1.7934PROTEIN (BLOOD COAGULATION FACTOR VIIA)14
2b7dL1.8024COAGULATION FACTOR VII15
6acnA1.8061ACONITASE16
2aerL1.8120COAGULATION FACTOR VII17
2aeiL1.8164COAGULATION FACTOR VII18
2flrL1.8196COAGULATION FACTOR VII19
3elaL1.8668COAGULATION FACTOR VII LIGHT CHAIN20
1z6jL1.8859COAGULATION FACTOR VII21
2f9bL1.9027COAGULATION FACTOR VII22
3phsA1.9169CELL WALL SURFACE ANCHOR FAMILY PROTEIN23
3n54B1.9263SPORE GERMINATION PROTEIN B324
3qbpB1.9264FUMARASE FUM25
3ma9L1.9367TRANSMEMBRANE GLYCOPROTEIN26
3mt0A1.9921UNCHARACTERIZED PROTEIN PA178927
3lguA2.0004PROTEASE DEGS28
3m7iA2.0071TRANSKETOLASE29
1qfkL2.0096PROTEIN (COAGULATION FACTOR VIIA (LIGHT CHAIN))30
3n9tA2.0169PNPC31
3pxzA2.0235CELL DIVISION PROTEIN KINASE 232
2flbL2.0257COAGULATION FACTOR VII33
3lh1A2.0289PROTEASE DEGS34
3nlcA2.0300UNCHARACTERIZED PROTEIN VP095635
3no5C2.0302UNCHARACTERIZED PROTEIN36
3msqC2.0347PUTATIVE UBIQUINONE BIOSYNTHESIS PROTEIN37
3rykA2.0389DTDP-4-DEHYDRORHSMNOSE 3,5-EPIMERASE38
3m4aA2.0444PUTATIVE TYPE I TOPOISOMERASE39
3n3nB2.0444CATALASE-PEROXIDASE40
2R3GA2.0456CELL DIVISION PROTEIN KINASE 241
2R3IA2.0465CELL DIVISION PROTEIN KINASE 242
3o0rL2.0473ANTIBODY FAB FRAGMENT LIGHT CHAIN43
3n3pB2.0490CATALASE-PEROXIDASE44
3nfhA2.0492DNA-DIRECTED RNA POLYMERASE I SUBUNIT RPA4945
3qfkA2.0501UNCHARACTERIZED PROTEIN46
3n5hF2.0517FARNESYL PYROPHOSPHATE SYNTHASE47
3n3oB2.0529CATALASE-PEROXIDASE48
3o78B2.0548CHIMERA PROTEIN OF PEPTIDE OF MYOSIN LIGHT CHAIN SMOOTH MUSCLE, GREEN FLUORESCENT PROTEIN, GREEN FLUORESCENT CALMODULIN49
3luyA2.0570PROBABLE CHORISMATE MUTASE50

(b) Top 50 hits using DRC for a query protein structure from the COX-2 family on DATASET-B. Annotations in bold correspond to members of the COX-2 family, predicted proteins, or uncharacterized proteins

ProteinChainDistanceProtein name annotationRank

2zxwB0.0000CYTOCHROME C OXIDASE SUBUNIT 21
2eilB1.3914CYTOCHROME C OXIDASE SUBUNIT 22
2eijB1.5853CYTOCHROME C OXIDASE SUBUNIT 23
3ag4B1.6147CYTOCHROME C OXIDASE SUBUNIT 24
2dysB1.6991CYTOCHROME C OXIDASE SUBUNIT 25
3ag1B1.8159CYTOCHROME C OXIDASE SUBUNIT 26
2eimB1.8824CYTOCHROME C OXIDASE SUBUNIT 27
3ag2B2.0173CYTOCHROME C OXIDASE SUBUNIT 28
2occB2.0631CYTOCHROME C OXIDASE9
3ablB2.1404CYTOCHROME C OXIDASE SUBUNIT 210
2eikB2.1413CYTOCHROME C OXIDASE SUBUNIT 211
3abmB2.1454CYTOCHROME C OXIDASE SUBUNIT 212
1v55B2.1489CYTOCHROME C OXIDASE POLYPEPTIDE II13
2dyrB2.2044CYTOCHROME C OXIDASE SUBUNIT 214
2einB2.2628CYTOCHROME C OXIDASE SUBUNIT 215
1v54B2.2976CYTOCHROME C OXIDASE POLYPEPTIDE II16
3n56B2.4226INSULIN-DEGRADING ENZYME17
3abkB2.4502CYTOCHROME C OXIDASE SUBUNIT 218
3p42A2.4785PREDICTED PROTEIN19
3r2uB2.4846METALLO-BETA-LACTAMASE FAMILY PROTEIN20
3msuB2.4920CITRATE SYNTHASE21
3ag3B2.4950CYTOCHROME C OXIDASE SUBUNIT 222
3ntdB2.5052FAD-DEPENDENT PYRIDINE NUCLEOTIDE-DISULPHIDE OXIDOREDUCTASE23
3ngiA2.5055DNA POLYMERASE24
7ximB2.5198D-XYLOSE ISOMERASE25
3mjyA2.5206DIHYDROOROTATE DEHYDROGENASE26
3nvaB2.5391CTP SYNTHASE27
3lm3A2.5433UNCHARACTERIZED PROTEIN28
3ppnB2.5511GLYCINE BETAINE/CARNITINE/CHOLINE-BINDING PROTEIN29
3o98B2.5557BIFUNCTIONAL GLUTATHIONYLSPERMIDINE SYNTHETASE/AM30
3pomB2.5558RETINOBLASTOMA-ASSOCIATED PROTEIN31
3nt6B2.5656FAD-DEPENDENT PYRIDINE NUCLEOTIDE-DISULPHIDE OXIDOREDUCTASE32
5lymB2.5690LYSOZYME33
3n1yB2.5728TOLUENE O-XYLENE MONOOXYGENASE COMPONENT34
1occB2.5772CYTOCHROME C OXIDASE35
3lxtD2.5860GLUTATHIONE S TRANSFERASE36
2q70B2.5922ESTROGEN RECEPTOR37
3l49B2.5930ABC SUGAR (RIBOSE) TRANSPORTER, PERIPLASMIC SUBSTRATE-BINDING SUBUNIT38
3pvqA2.5944DIPEPTIDYL-PEPTIDASE VI39
3pufB2.6032RIBONUCLEASE H2 SUBUNIT B40
3mveB2.6077UPF0255 PROTEIN VV1_032841
3ld2B2.6143PUTATIVE ACETYLTRANSFERASE42
3ne6A2.6160DNA POLYMERASE43
3qaeA2.61793-HYDROXY-3-METHYLGLUTARYL-COENZYME A REDUCTASE44
3qh8A2.6197BETA-LACTAMASE-LIKE45
3m3rA2.6215ALPHA-HEMOLYSIN46
3nrbB2.6234FORMYLTETRAHYDROFOLATE DEFORMYLASE47
3n05B2.6240NH(3)-DEPENDENT NAD(+) SYNTHETASE48
3m2lA2.6324ALPHA-HEMOLYSIN49
3pnsB2.6357URIDINE PHOSPHORYLASE50

4.5. Comparison with Related Methods

The use of distance distributions for protein surface analysis was studied by Binkowski et al. [11]. As earlier discussed, they did not consider the specific residues in constructing the distributions. Their distance distribution (labeled as DD1 in this work) is obtained by removing the reference to the specific residue at the center of the patch (see Figure 2). Our use of surface residue cooccurrences and combining these with the residue-specific distance distributions are novel methods introduced in this paper. Tables 3(a) and 3(b) compare the overall classification performance using DD1 with those obtained with the proposed descriptors.

(a) Overall classification rate using different classifiers (300 training samples, 100 testing samples from DATASET-A)

ClassifierDescriptor
DD1DD2RCDRC

Naïve bayes58%86%94%91%
Logistic58%85%99%97%
Simple logistic58%89%98%91%

(b) Overall classification rate using different classifiers (100 training samples, 300 testing samples from DATASET-A)

ClassifierDescriptor
DD1DD2RCDRC

Naïve bayes55%74%94%93%
Logistic62%88%89%94%
Simple logistic63%85%91%90%

Figure 6 also shows the comparative performance using both the enrichment plots, and precision and recall. We define precision and recall at a given distance threshold as follows: precision = (number of correct retrievals at the threshold)/(number of total retrievals at the threshold). Recall = (number of correct retrievals at the threshold)/(number of total true matches expected at the threshold). Here, using the ranked results, for a given query and a given rank, the number of expected true matches will be min{rank, query group size}. This is similar to the definition used in [28]. We performed queries on DATASET-A using query proteins from each of the three groups and computed the average precision and recall for each descriptor. We then computed the area under the curve (AUC) for the average precision-recall plots. The results were as follows: DD1 (0.501052), DD2 (0.649412), RC (0.668303), and DRC (0.66759). Although the databases used are different, these results compare well with the results reported by Sael and Kihara [28], where they evaluated the retrieval performance of four surface characterization methods, based on the Zernike representation. The maximum AUC reported using standard resolution surfaces was 0.608 (without length filtering) and 0.628 (with length filtering). Yin et al. [8] proposed a fingerprint-based method, using surface alignment on selected surface patches. Their method constructs an initial patch on every vertex on the dot surface, and requires computation of geodesic distances on the surface, two very time-consuming processes. Our method neither requires surface alignment, nor expensive computations on the surface, beyond the surface generation process. Patches are generated only on positions of the surface residues, rather than over all the vertices on the generated protein surface.

4.6. Computation Time

The most time consuming part was for preprocessing, as needed to construct the protein surfaces and extract the protein chains. The construction of the protein surface from the original PDB files required about 4.065 seconds, running on Cygwin (a version of Linux for Microsoft Windows). Extraction of the protein chains and the Cα atoms was performed using Matlab Bioinformatics Toolbox (Mathworks Inc., Natick, Mass, USA), and required 32 seconds per PDB file. Construction of the descriptors after the above steps took an average of 0.7 seconds per PDB file. Querying DATASET-B (15,386 chains, 6,261 unique PDB files) using the DRC descriptor required an average time of 0.28 seconds for each query PDB file.

5. Conclusion

We have introduced a novel approach to the description and characterization of protein surfaces. The proposed approach captures the surface structure of the protein by utilizing local patches defined only on the positions of surface residues, rather than over all surface vertices, or over all the surface atoms. We make residue cooccurrences on the surface a central part of the descriptor. The novelty of this approach can be observed by the ease of handling both local and global variation on the surface (using local and global descriptors). Moving from local to global not only reduces the computational problem of matching 3D structures, but also facilitates direct comparison between protein structures of different sizes. By avoiding the construction of the complete 3D surface and retaining only the surface Cα to do the analysis, the need for surface alignment of the 3D structure is eliminated. Further, we do not need to perform any geometrical transformation to insure reliable matching. This is very important for rapid analysis over a large database, such as the PDB.

We showed results on the performance of the proposed methods in functional classification of proteins into their putative families, based on the surface information. We further compared the results using enrichment plots, and the standard measures of precision and recall. For the three protein families used, we obtained an area under the curve for precision and recall of 0.6494 (DD2), 0.6683 (RC), and 0.6676 (DRC). A screening of the PDB using COX-2 and EGF family members showed that the proposed methods ranked related family members in the Top-20 hits, with a number of uncharacterized proteins also retrieved. It will be interesting to perform further biological lab experiments to verify if any of the retrieved uncharacterized proteins are truly related to the respective families to which they share similar surfaces (as determined by our surface descriptors).

Acknowledgment

The first author was supported in part by the Ministry of Higher Education of Saudi Arabia.

References

  1. L. Jaroszewski, Z. Li, S. S. Krishna et al., “Exploration of uncharted regions of the protein universe,” PLoS Biology, vol. 7, no. 9, Article ID e1000205, 2009. View at: Publisher Site | Google Scholar
  2. Y. T. Yan and W. H. Li, “Identification of protein functional surfaces by the concept of a split pocket,” Proteins, vol. 76, no. 4, pp. 959–976, 2009. View at: Publisher Site | Google Scholar
  3. Y. Loewenstein, D. Raimondo, O. C. Redfern et al., “Protein function annotation by homology-based inference,” Genome Biology, vol. 10, no. 2, pp. 207.1–207.8, 2009. View at: Google Scholar
  4. S. Sivashankari and P. Shanmughavel, “Functional annotation of hypothetical proteins—a review,” Bioinformation, vol. 1, no. 8, pp. 335–338, 2006. View at: Google Scholar
  5. F. Ferrè, G. Ausiello, A. Zanzoni, and M. Helmer-Citterich, “SURFACE: a database of protein surface regions for functional annotation,” Nucleic Acids Research, vol. 32, pp. D240–D244, 2004. View at: Google Scholar
  6. D. Gront, A. Kolinski, and U. H. E. Hansmann, “Protein structure prediction by tempering spatial constraints,” Journal of Computer-Aided Molecular Design, vol. 19, no. 8, pp. 603–608, 2005. View at: Publisher Site | Google Scholar
  7. L. Sael, B. Li, D. La et al., “Fast protein tertiary structure retrieval based on global surface shape similarity,” Proteins: Structure, Function, and Bioinformatics, vol. 72, no. 4, pp. 1259–1273, 2008. View at: Publisher Site | Google Scholar
  8. S. Yin, E. A. Proctor, A. A. Lugovskoy, and N. V. Dokholyan, “Fast screening of protein surfaces using geometric invariant fingerprints,” Proceedings of the National Academy of Sciences of the United States of America, vol. 106, no. 39, pp. 16622–16626, 2009. View at: Publisher Site | Google Scholar
  9. N. Nagano, C. A. Orengo, and J. M. Thornton, “One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions,” Journal of Molecular Biology, vol. 321, no. 5, pp. 741–765, 2002. View at: Publisher Site | Google Scholar
  10. H. H. Gan, R. A. Perlow, S. Roy et al., “Analysis of protein sequence/structure similarity relationships,” Biophysical Journal, vol. 83, no. 5, pp. 2781–2791, 2002. View at: Google Scholar
  11. T. A. Binkowski and A. Joachimiak, “Protein functional surfaces: global shape matching and local spatial alignments of ligand binding sites,” BMC Structural Biology, vol. 8, article 45, 2008. View at: Publisher Site | Google Scholar
  12. A. Godzik, “The structural alignment between two proteins: is there a unique answer?” Protein Science, vol. 5, no. 7, pp. 1325–1338, 1996. View at: Google Scholar
  13. L. K. Buehler and H. H. Rashidi, Bioinformatics Basics: Applications in Biological Science and Medicine, CRC Press, 2005.
  14. D. E. Krane and M. L. Raymer, Fundamental Concepts of Bioinformatics, Pearson Education, 2003.
  15. D. Whitford, Proteins: Structure and Function, John Wiley & Sons, West Sussex, UK, 2005.
  16. M. F. Sanner, A. J. Olson, and J. C. Spehner, “Reduced surface: an efficient way to compute molecular surfaces,” Biopolymers, vol. 38, no. 3, pp. 305–320, 1996. View at: Google Scholar
  17. C. Orengo, D. Jones, and J. Thornton, Bioinformatics Genes, Proteins, and Computers, BIOS Scientific, New York, NY, USA, 2003.
  18. C. Gibas and P. Jambeck, Developing Bioinformatics Computer Skills, O'Reilly Media, Sebastopol, Calif, USA, 2001.
  19. M. L. Connolly, “Shape distributions of protein topography,” Biopolymers, vol. 32, no. 9, pp. 1215–1236, 1992. View at: Google Scholar
  20. P. Rogen and B. Faint, “Automatic classification of protein structure by using Gauss integrals,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 1, pp. 119–124, 2003. View at: Publisher Site | Google Scholar
  21. M. J. Bayley, E. J. Gardiner, P. Willett, and P. J. Artymiuk, “A fourier fingerprint-based method for protein surface representation,” Journal of Chemical Information and Modeling, vol. 45, no. 3, pp. 696–707, 2005. View at: Publisher Site | Google Scholar
  22. R. J. Morris, R. J. Najmanovich, A. Kahraman, and J. M. Thornton, “Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons,” Bioinformatics, vol. 21, no. 10, pp. 2347–2355, 2005. View at: Publisher Site | Google Scholar
  23. V. Venkatraman, L. Sael, and D. Kihara, “Potential for protein surface shape analysis using spherical harmonics and 3d zernike descriptors,” Cell Biochemistry and Biophysics, vol. 54, no. 1–3, pp. 23–32, 2009. View at: Publisher Site | Google Scholar
  24. L. P. Albou, B. Schwarz, O. Poch, J. M. Wurtz, and D. Moras, “Defining and characterizing protein surface using alpha shapes,” Proteins, vol. 76, no. 1, pp. 1–12, 2009. View at: Publisher Site | Google Scholar
  25. J. Gramm, “A polynomial-time algorithm for the matching of crossing contact-map patterns,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 171–180, 2004. View at: Publisher Site | Google Scholar
  26. K. Lasker, O. Dror, M. Shatsky, R. Nussinov, and H. J. Wolfson, “EMatch: discovery of high resolution structural homologues of protein domains in intermediate resolution Cryo-EM maps,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 28–39, 2007. View at: Publisher Site | Google Scholar
  27. K. Park and D. Kim, “Binding similarity network of ligand,” Proteins, vol. 71, no. 2, pp. 960–971, 2007. View at: Google Scholar
  28. L. Sael and D. Kihara, “Improved protein surface comparison and application to low-resolution protein structure data,” BMC Bioinformatics, vol. 11, no. 11, article S2, 2010. View at: Publisher Site | Google Scholar
  29. Y. Murakami and S. Jones, “SHARP2: protein-protein interaction predictions using patch analysis,” Bioinformatics, vol. 22, no. 14, pp. 1794–1795, 2006. View at: Publisher Site | Google Scholar
  30. S. Jones and J. M. Thornton, “Analysis of protein-protein interaction sites using surface patches,” Journal of Molecular Biology, vol. 272, no. 1, pp. 121–132, 1997. View at: Publisher Site | Google Scholar
  31. A. Via, F. Ferrè, B. Brannetti, and M. Helmer-Citterich, “Protein surface similarities: a survey of methods to describe and compare protein surfaces,” Cellular and Molecular Life Sciences, vol. 57, no. 13-14, pp. 1970–1977, 2000. View at: Google Scholar
  32. W. R. Taylor, A. C. W. May, N. P. Brown, and A. Aszódi, “Protein structure: geometry, topology and classification,” Reports on Progress in Physics, vol. 64, no. 4, pp. 517–590, 2001. View at: Publisher Site | Google Scholar
  33. I. G. Choi, J. Kwon, and S. H. Kim, “Local feature frequency profile: a method to measure structural similarity in proteins,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 11, pp. 3797–3802, 2004. View at: Publisher Site | Google Scholar
  34. N. Canterakis, “3D zernike moments and zernike affine invariants for 3D image analysis and recognition,” in Proceedings of the 11th Scandinavian Conference on Image Analysis, pp. 85–93, 1999. View at: Google Scholar
  35. R. Mukundan and K. R. Ramakrishnan, “Fast computation of Legendre and Zernike moments,” Pattern Recognition, vol. 28, no. 9, pp. 1433–1442, 1995. View at: Publisher Site | Google Scholar
  36. M. L. Connolly, “Solvent-accessible surfaces of proteins and nucleic acids,” American Association for the Advancement of Science, vol. 221, no. 4612, pp. 709–713, 1983. View at: Google Scholar
  37. J. Smith, “Computing a triangulated surface with MSMS. In Vanderbilt University Center for Structural Biology,” 2011, http://structbio.vanderbilt.edu/comp/soft/msms/tutorial.php. View at: Google Scholar
  38. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” SIGKDD Explorations, vol. 11, no. 1, 2009. View at: Google Scholar
  39. R. R. Bouckaert, E. Frank, M. A. Hall et al., “WEKA—experiences with a Java open-source project,” Journal of Machine Learning Research, vol. 11, pp. 2533–2541, 2010. View at: Google Scholar

Copyright © 2011 Zainab Abu Deeb et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1414 Views | 625 Downloads | 6 Citations
 PDF  Download Citation  Citation
 Download other formatsMore
 Order printed copiesOrder