Abstract

Residue-residue contact prediction has become an increasingly important tool for modeling the three-dimensional structure of a protein when no homologous structure is available. Ultradeep residual neural network (ResNet) has become the most popular method for making contact predictions because it captures the contextual information between residues. In this paper, we propose a novel deep neural network framework for contact prediction which combines ResNet and DenseNet. This framework uses 1D ResNet to process sequential features, and besides PSSM, SS3, and solvent accessibility, we have introduced a new feature, position-specific frequency matrix (PSFM), as an input. Using ResNet’s residual module and identity mapping, it can effectively process sequential features after which the outer concatenation function is used for sequential and pairwise features. Prediction accuracy is improved following a final processing step using the dense connection of DenseNet. The prediction accuracy of the protein contact map shows that our method is more effective than other popular methods due to the new network architecture and the added feature input.

1. Introduction

Proteins perform a wide range of cellular functions and, in most instances, their function is related to their structure. Experimental determination of protein structure is time-consuming and expensive; therefore, accurate protein structure prediction can play a vital role in understanding protein function. If a protein of interest is homologous to one whose structure has already been determined, it is possible to model the structure using the homologous protein’s structure as a template. For many proteins, there are no suitable templates available and it is therefore necessary to develop methods that can use only the amino acid sequence to predict protein structure. It has been shown that the best method to obtain a structure is to determine whether a pair of residues in a protein sequence is in contact.

The current prediction methods used to construct protein contact maps are divided into direct coupling analysis (DCA) methods and machine learning methods. DCA methods utilize multiple sequence alignments (MSAs) to determine the correlation between residue pairs. While this method achieves results when the target protein sequence has many homologous sequences in the protein database, the evolutionary coupling information generates “noise signals.” DCA generally uses graphical lasso and pseudo-likelihood maximization methods to solve this problem [1]. Graphical lasso can estimate the graph structure from the covariance matrix using a likelihood estimation of the precision matrix with L1 regularization. Pseudo-likelihood maximization is an approximate method for probabilistic models to estimate the strength of interactions between residues. Popular DCA methods include CCMPred [2], FreeContact [3], GREMLIN [4], and PSICOV [5]. While these methods are useful for constructing protein contact maps when a high number of sequence homologs are available, their accuracy is poor when the number of homologs is low.

Machine learning methods have been widely used to make various protein predictions and can perform better than DCA methods when fewer homologous sequences are available because they can learn sequence feature relationships when given a labelled dataset. The first machine learning methods used support vector machines (SVM) [6] and other related methods such as SVMCon [7] and R2C [8], due to their capacity to construct classification models. With the development of artificial neural networks, deep learning methods (including various forms of recurrent neural networks [9] and deep belief networks [10]) have become mainstream frameworks for biological prediction programs including Betacon [11], CMAPPro [12], DeepConPred [13], NNCon [14], and MetaPSICOV [15]. RaptorX-Contact [16] and DNCON2 [17] are recently released methods and are the approaches that attempt to use the entire protein image as a context prediction. It should be noted that all methods, except for DEEPCOV [18], use one or several DCA methods as their inputs.

RaptorX-Contact [16] is one of the state-of-the-art contact predictors. Its high accuracy, as confirmed by CASP [19, 20], demonstrates the benefit of acquiring the whole protein for use as a context for constructing a contact map. It applies the ResNet structure for the prediction which can solve the problems of gradient disappearance and explosion due to its identity and residual mapping characteristics, but the number of parameters is proportional to its depth. DNCON2 [17] divides its predictor into two parts. The first uses a series of intermediate convolutional neural networks to predict the contact map at five distances (6~10 Å), and the second combines these separate predictions into another convolutional neural networks to provide a final contact map at 8 Å. The PconsC4 method [21] is one of the latest contact map prediction methods which is composed of ResNet [22] and U-net [23] network structures. It can capture 1D and 2D protein features to predict the contact map. However, the feature map size from the U-net network differs between input and output and is therefore inconsistent before and after downsampling. This means that upsampling data cannot be restored entirely before downsampling, which may have a negative impact on prediction accuracy.

In this paper, we present an integrated deep learning network-based approach to predict a protein contact map. We trained the network framework using a protein training dataset with known structures then tested it on public datasets including the CASP [19, 20], CAMEO [24], and membrane protein [25] datasets. In our method, sequential and pairwise features are used as input for the network structure. For the sequential features, besides sequence profile (PSSM), predicted protein secondary structure (SS3) and solvent accessibility (ACC), we introduce a new position-specific frequency matrix (PSFM) feature which we have found complements the PSSM features. The pairwise features include direct coevolution information generated by CCMPred, mutual information from MSA, and pairwise potential [26]. Fusion of these features effectively represents the features of the protein sequence necessary for protein contact prediction. The deep learning framework used in our method is composed of ResNet and DenseNet. This network structure fully integrates the advantages of identical mapping and residual mapping of ResNet with the dense connection of DenseNet, so that the network depth of our method is not too deep, it can effectively reduce gradient disappearance, enhance feature transmission, and to a certain extent, reduce the number of parameters. DenseNet’s input and output feature map formats remain the same, which allow for greater feature retention and thus improve the accuracy of predicted protein contact map. Our experimental results show that our method yielded better accuracy than other popular methods.

2. Models and Methods

2.1. Datasets

The dataset used is a subset of PDB25, extracted from the PDB database (http://www.rcsb. org/pdb/home/home.do), in which the sequence identity of any two proteins is less than 25%. Proteins satisfying any one of the following conditions were excluded (1) sequence length is less than 26 or greater than 700 amino acids, (2) resolution is less than 2.5 Å, and (3) has multiple protein chains. To eliminate redundancy in the dataset, we exclude any proteins which had >25% sequence identity which left us with 6767 proteins. We randomly chose 6000 proteins to train the model and used the remaining proteins to validate it. To evaluate our method, we used the widely used and publically available hard target datasets, CAMEO [24], Mems400 [25], and CASP12 and 13 [19, 20]. In these test sets, the sequence identity between any two protein sequences was less than 25%.

2.2. Contact Map Definition

If two residues are in contact in the protein contact map [12] means that the Euclidean distance between the two Cβ atoms of the residues (glycine is a Cα atom) is less than 8 Å. Contacts are divided into three categories based on the separation between the two residues: (1) long-range contacts, when the separation is greater than 24 residues; (2) medium-range contacts, when the separation is between 12 and 23 residues; and (3) short-range contacts, when the separation is between 6 and 11 residues.

A protein contact map example is shown in Figure 1. It illustrates the probability of contact between the two residues. The horizontal and vertical coordinates represent the protein sequence, with colored dots indicating the probability of contact between the two residues (range between 0 and 1, the redder the color, the higher the possibility of contact).

2.3. Feature Extraction

In this method, there are two types of protein features, one-dimensional (sequential feature) and two-dimensional (pairwise feature), which are used in the prediction of the protein contact map. The one-dimensional feature includes sequence profile (position-specific scoring matrix, PSSM), 3-state protein secondary structure (SS3), and 3-state solvent accessibility (ACC). The 20-dimensional position specific scoring matrix (PSSM) [27] is obtained by searching the uniprot_sprot database (ftp://ftp.uniprot.org/pub/databases/uniprot/current_ release/knowledgebase/complete) with HHblits [28] to generate sequence profiles with three iterations and -values set to 0.001. The 3-state protein secondary structure is taken from Bi-LSTM [29], and the 3-state solvent accessibility is taken from DSPRED [30]. Two-dimensional features include direct coevolution information generated by CCMPred, mutual information from the multiple sequence alignment (MSA), and pairwise potential [26].

Because the position-specific frequency matrix (PSFM) [31] contains the frequency of amino acids in the protein sequence, we can add it to the input feature to complement the advantages of the position-specific scoring matrix. Here, we ran the HHblits [28] program, with three iterations and -values set to 0.001, to search the uniprot_sprot database to generate MSA, and then calculate PSSM and PSFM based on the HHblits results. With = protein sequence length and = feature dimension, PSSM is represented by a two-dimensional matrix of , the secondary structure is represented by a two-dimensional matrix of , the solvent accessibility is denoted by a two-dimensional matrix of , and the PSFM is represented by a two-dimensional matrix of (Figure 2). The one-dimensional feature of our method is then expressed by a two-dimensional matrix of . The two-dimensional feature is represented by a three-dimensional matrix of . The element in the PSFM matrix is the target frequencies, which represents the occurrence frequency of one amino acid at a specific position in the protein sequence in the evolutionary process. The sum of the frequencies in each line is 1.

2.4. Prediction Model

We propose a new integrated deep learning method to map predicted protein contacts composed of residual neural network (ResNet) [22] and densely connected convolutional networks (DenseNet) [32] which form a neural network framework.

2.4.1. Residual Neural Network

The residual neural network (ResNet) consists of a residual learning model (Figure 3) which can be defined as: where and are the input and output vectors of the layers to be considered, is the weight in the weight matrix, and represents the residual mapping to be learned. For an example that has two layers (Figure 3), its residual mapping function is as follows: where denotes the Rectified Linear Unit activation function, and , , , and are the weights and biases of the first layer and the second layer, respectively.

2.4.2. Densely Connected Convolutional Networks

Compared with the convolutional neural network and other deep learning methods, the residual neural network can, to a certain extent, solve the problems of gradient descent and disappearance. The number of residual neural network parameters is proportional to its depth and because each layer has independent weights, when the number of layers increases, the number of parameters does too. Fortunately, this problem can be effectively solved by the densely connected convolutional networks (DenseNet). DenseNet consists of a dense block, a transition layer, and a bottleneck layer. The dense block (Figure 4) is composed of an -layer network and a composite function. The composite function consists of a normalization function, linear rectification unit, and convolution function. The -layer of its th layer network has inputs, that is, the th layer receives all the outputs of feature maps from the previous layer. Its construction formula is as follows: where means to connect the feature map from layer 0 to layer .

The transition layer is composed of convolution and pool layers. The bottleneck layer consists of a convolution which is used to reduce the number of feature maps and improve calculation efficiency.

2.4.3. Integrated Model

Our framework is based on an integrated deep neural network (Figure 5) and is composed of one-dimensional residual and densely connected convolutional networks. ResNet can, to a certain extent, solve the problems of gradient disappearance and explosion due to its identity and residual mapping characteristics and can train the deep network structure, but the number of ResNet parameters is proportional to its depth. DenseNet can effectively reduce the problem of gradient disappearance due to its dense connection characteristics and it can, to a certain extent, reuse features, thus strengthening feature transfer and reducing parameter numbers. DenseNet retains the input and output feature map formats, so it can maintain features as much as possible.

In summary, we have integrated two kinds of network structures (ResNet (Figure 3) and DenseNet (Figure 4)) and made use each one’s advantages to improve the accuracy of the predicted protein contact map. In the data preparation stage of our framework, sequential features (one-dimensional features) are represented by a vector of , which are sent into the ResNet network. Then, pairwise features (two-dimensional features) are combined with one-dimensional features from one-dimensional residual network, and all of them are sent into the DenseNet network.

In ResNet, there are several residual learning models, each of which is composed of two convolution layers with convolution kernel size of 3. After each convolution layer, there is a Rectified Linear Units activation function [33], and then an outer concatenation function [16] is applied to convert the output results from two-dimensional to three-dimensional. Namely, let be the final output of the first module where is the protein sequence length and is a feature vector storing the output information for residue . For a pair of residues, and , we concatenate , , and into a single vector and use it as one input feature of this residue pair. We then combine them with pairwise features to form the input for the second part of the network. To prevent the network from overfitting, we utilize a dropout algorithm with an 80% dropout ratio to randomly discard neurons during training. We used an effective stochastic optimization method using the gradient descent optimization and set the learning rate as 0.01. In our model, the maximum likelihood function is used to train the model parameters, and the loss function is defined as a negative log-likelihood function, namely, the cross-entropy function. The formula is where is the label and is the predicted result.

2.5. Performance Evaluation

The results can be divided into four categories: true positive (TP), false negative (FN), false positive (FP), and true negative (TN). TP refers to a positive group samples that are correctly predicted, FN refers to a positive group that is incorrectly predicted to be negative, FP refers to a negative group that is incorrectly predicted to be positive, and TN refers to a correctly predicted negative group. Based on these indexes, we use the following evaluation criteria to predict the performance of our method and compare it to other methods.

Precision refers to the proportion of the correct number of positive samples in the total number of samples determined by the classifier to be positive.

Recall refers to the proportion of the correct number of positive samples in the actual number of positive samples.

F1 score is the harmonic mean of the precision and recall.

3. Experimental Results

In our experiment, we use top () in the long-range contact to evaluate prediction accuracy of the protein contact map. is the length of the sequence, and prediction accuracy rates are given in three kinds of contact. To verify our model’s validity, we tested our prediction accuracy on the PDB25, CAMEO, and Mems400 datasets (http://raptorx.uchicago.edu/contactmap/), and on easy and hard targets from CASP12 and 13 [19, 20]. We chose some typical prediction methods implemented by DCA and machine learning for the comparison. State-of-the-art methods include CCMPred [2] (using DCA method), RaptorX-Contact [16] (based on double ResNet), and PconsC4 [21] (combination of ResNet and U-net). It should be noted that amino acid sequences in the test set have no similarity with the training set (at the 25% identity level) to prevent any overestimation of our predictor’s performance, and that we used the same datasets for all four models.

In order to verify the effectiveness of the proposed neural network structure, we constructed three network structures by different combinations of ResNet and DenseNet, namely, , , and and compared them with our framework. The result is shown in the attached Table 1. We find the prediction accuracy by our network structure is higher than that by other three network structures.

To verify the validity of the proposed feature input, two other feature combinations (our feature combination without PSFM feature or PSSM feature) were designed for the experimental comparison. The result is shown in Table 2. We find the feature combination in the proposed method can obtain better accuracy than other two feature combinations.

The accuracy of the long-range contact predictions on the PDB25 dataset is illustrated in Figure 6, and the detailed prediction accuracies of the long-range contact in () are shown in Table 3. Compared with RaptorX-Contact, our method has an increase of 1.9%, 0.4%, and 1.8% in (), with PconsC4 an increase of 5.5%, 4%, 5.9%, and 3.7% in (), and with CCMPred an increase of 14.5%, 12.3%, 13.8%, and 15.4% in (). We find that our prediction accuracy is better than that of PconsC4 and CCMPred in long-range contact and while our prediction accuracy in top () is like that of RaptorX-Contact, our prediction accuracy of top () is higher. The prediction comparison of the medium and short-range contacts in () is also shown in Table 3.

The prediction accuracy of long-range contact on the 76 hard CAMEO test set is illustrated in Figure 7, and the detailed prediction results for different methods on the 76 hard CAMEO dataset are shown in Table 4. Compared with PconsC4, our method has an increase of 4.6%, 2.5%, 2%, and 0.9%, with RaptorX-Contact an increase of 2%, 2%, and 1.5% in (), and with CCMPred a significant increase for the accuracy of long-range contact in (). The prediction comparison of the medium and short-range contact in () is also shown in Table 4.

For the Mems400 dataset, the long-range contact prediction accuracy is shown in Figure 8, and the detailed prediction results by different methods are shown in Table 5. Our model’s prediction accuracy of the long-range contact in () is 80.1%, 75.2%, 64.3%, and 47.1%, respectively. Compared with PconsC4, there is an increase of 4.5%, 4.4%, 4.7%, and 2.4%, with RaptorX-Contact an increase of 2.1%, 2.1%, 2%, and 0.1%, and with CCMPred there is also a significant increase for the accuracy of long-range contact in ().

For the CASP12 dataset, we separate the long, medium and short contact results on the hard and easy CASP12 targets, which are shown in Table 6. We find the performance on the easy CASP12 targets is a little better than the hard CASP12 targets.

The accuracy comparison of long-range contact prediction by different methods is illustrated in Figure 9, and Table 7 shows the detailed prediction results with long, medium and short contacts. For the long-range contact prediction, our accuracy in () is 64.9%, 60.1%, 51.4%, and 40.3%, respectively. Compared to PconsC4, there is an increase of 2.6%, 5.4%, 2.8%, and 0.6%, to RaptorX-Contact, the increase is about 1.0%, 1.2%, 1.2%, and 0.1% in () and with CCMPred, there is also a significant increase for the accuracy of long-range contact in (). For medium contact and short contact, we find most of our accuracy results in () are better than PconsC4, RaptorX-Contact, and CCMPred.

For the CASP13 dataset, we divide the CASP13 targets into hard and easy targets which are shown in Tables 8 and 9. Besides, we separate the long, medium and short contact results on the hard and easy CASP13 targets which are shown in Table 10.

The accuracy of the long-range contact predictions on the CASP13 dataset is illustrated in Figure 10, and the detailed prediction accuracies of the long-range contact in () are shown in Table 11. Compared with RaptorX-Contact, our method has an increase of 0.7%, 0.6%, 0.6%, and 0.1% in (), with PconsC4 and CCMPred there is a significant increase for the accuracy of long-range contact in (). The prediction comparison of the medium- and short-range contacts in () is also shown in Table 11. We find that our prediction accuracy is better than that of RaptorX-Contact, PconsC4, and CCMPred methods.

To further analyze the performance of our network framework, we made a comparison image of predicted contact and real contact for the protein sequences in the related test set. Figures 1113 are the comparison chart between the prediction contact map and true contact map, where red (green) dots indicate correct (wrong) predictions and silver dots indicate true contacts. 5eo9B is a 206-residue long alpha-helix, beta-fold protein that binds to random curls released by the CAMEO dataset on 2016-01-06, and the correct (wrong) predicted contact and true contact of this protein is shown in Figure 11. From this figure, we can see that the overall majority of predicted contacts are correct.

We analyzed the contact prediction accuracy for other proteins in this way. 1qd6C is a 240-residue long protein with β-fold combining with random distortion released by the Mems400 dataset in 1999-10-25. Figure 12(a) shows the correct (wrong) predicted and true protein contacts. T0944 is a 220-residue long alpha-helix, beta-folding protein that binds to random curling released by the CASP12 dataset. Figure 12(b) shows the correct (wrong) predicted and true protein contacts. We also added the all-alpha and all-beta proteins to show the contact prediction accuracy by the proposed method. 2porA is a 301-residue long protein with all β-fold released by the Mems400 dataset. And 4xmqB is a 254-residue long protein with all α-helix released by the CAMEO dataset. Their correct (wrong) predicted and true protein contacts are shown in Figures 13(a) and 13(b). It can be seen that the proposed method is suitable for the contact map prediction of all-alpha or all-beta proteins. From these examples, we find that our method correctly predicted most contacts, and these improved contact map results are useful for the assisted structure prediction of proteins with various structures.

4. Conclusion and Future Work

In this paper, we have presented a prediction method for constructing protein contact maps using an integrated framework with ResNet and DenseNet. This method combines the advantages of ResNet’s identity and residual mapping with DenseNet’s dense connection and fully exploits them to help reduce the gradient disappearance problem and feature reusability, reduce the number of parameters, and capture the complex sequence-contact relationship and correlation between features. For the input feature, we have added a new position-specific frequency matrix feature (PSFM) besides the position-specific scoring matrix (PSSM), secondary structure (SS3), and 3-state solvent accessibility (ACC). These measures can effectively process sequential and pairwise features to predict the contact probability between residues and improve the prediction accuracy. The experimental results show that our proposed method is superior to other well-known methods. For easy implementation, all data used in this work and the source code for feature computing can be accessible at https://http://github.com/lnyile/Protein-Contact-Map-Rse_Dense.

While the accuracy of our model’s top () predictions was better than existing methods, the accuracy of top prediction was not always significantly better. Combining more effective features as inputs and constructing a new deep learning neural network framework will further improve precision. Directions for future work include using the graphical representation and structural similarity of protein sequences to construct feature vectors as input for the deep learning framework to improve our model’s predictions.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant Nos. 11671009 and 61762035 and the Zhejiang Provincial Natural Science Foundation of China under Grant No. LZ19A010002.