BioMed Research International

Volume 2015, Article ID 902198, 10 pages

http://dx.doi.org/10.1155/2015/902198

## Using Weighted Sparse Representation Model Combined with Discrete Cosine Transformation to Predict Protein-Protein Interactions from Protein Sequence

^{1}College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518060, China^{2}School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China^{3}Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Suzhou, Jiangsu 215163, China^{4}School of Electronic and Information Engineering, Soochow University, Suzhou, Jiangsu 215123, China

Received 13 August 2015; Accepted 4 October 2015

Academic Editor: Alok Sharma

Copyright © 2015 Yu-An Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Increasing demand for the knowledge about protein-protein interactions (PPIs) is promoting the development of methods for predicting protein interaction network. Although high-throughput technologies have generated considerable PPIs data for various organisms, it has inevitable drawbacks such as high cost, time consumption, and inherently high false positive rate. For this reason, computational methods are drawing more and more attention for predicting PPIs. In this study, we report a computational method for predicting PPIs using the information of protein sequences. The main improvements come from adopting a novel protein sequence representation by using discrete cosine transform (DCT) on substitution matrix representation (SMR) and from using weighted sparse representation based classifier (WSRC). When performing on the PPIs dataset of *Yeast*, *Human*, and *H. pylori*, we got excellent results with average accuracies as high as 96.28%, 96.30%, and 86.74%, respectively, significantly better than previous methods. Promising results obtained have proven that the proposed method is feasible, robust, and powerful. To further evaluate the proposed method, we compared it with the state-of-the-art support vector machine (SVM) classifier. Extensive experiments were also performed in which we used *Yeast* PPIs samples as training set to predict PPIs of other five species datasets.

#### 1. Introduction

Proteins are the molecules which participate in virtually every aspect of cellular function within an organism and responsible for the majority of the activities of living cells. Usually, proteins rarely carry out their functions alone. For example, structural proteins need to work in pairs to shape organelles and the whole cell, and the same is true for ribosome, RNA polymerases, and multisubunit channels in membranes. Detecting protein-protein interactions (PPIs) can provide a great insight into molecular mechanisms of biological processes and promote the practical medical applications based on those mechanisms. Much effort has been devoted to identifying protein interaction using high-throughput technologies such as yeast two-hybrid (Y2H) screens [1, 2], tandem affinity purification (TAP) [3], and mass spectrometric protein complex identification (MS-PCI) [4]. However, these experimental methods are still time-consuming and expensive. In addition, they yield many false positives and can only identify a small fraction of the whole protein interaction network. For this reason, the issue of predicting unknown PPIs is now considered hard to be solved only by using experimental methods.

For the sake of utilizing the available PPIs data experimentally obtained, it is of much significance to develop computational methods for predicting protein-protein interactions. A number of experiments which depict PPI networks of living organism have been finished and a number of datasets such as MINT [5], BIND [6], and DIP [7] have been built to store proteins interaction data. However, the quantities of these different kinds of available protein data such as protein sequences, secondary structures, and tertiary structures are in different levels. Protein sequence data hold a great advantage in quantitative term. With the exponential growth of newly discovered protein sequence data, it is increasingly important to develop computational methods using the information of amino acid sequences. Sequence-based computational approaches usually contain two steps: feature extraction and sample classification [8–13].

Feature extraction from protein sequence aims to mine the most representative attributes from the samples and to normalize different-length protein sequences to vectors of the same size. Efficient feature descriptors are capable of improving the performance of classification model [10, 14]. Until now, a number of feature extraction methods based on protein sequences have been proposed. Most of these methods are based on Chou’s pseudoamino acid composition (PseAAC) [15, 16]. PseAAC expends the simple amino acid composition (AAC) by considering and retaining the information of sequence order. Different kinds of feature descriptors based on PseAAC prove to be powerful and become popular in protein feature extraction. However, some other feature extraction methods have put forward new ways which are based on kernels. Jaakkola et al. [17] have first proposed Fisher kernel for homology detection. Equally, mismatch string kernel proposed by Leslie et al. [18–20] measures sequence similarity counting the shared occurrences of subsequences in a lower computational cost. Unlike PseAAC-based feature extraction methods which extract feature directly from protein sequences, kernel-based methods remain some kinds of prior information and therefore extract more comprehensive feature descriptors.

In this work, we employ a novel kernel-based feature extraction method using the substitution matrix representation (SMR). In the process of evolution, the protein sequences gradually alter with the action of DNA mutations from one generation to the next. Thus, in the process of extracting protein sequence features, it is reasonable to consider the influence of the rate at which one character in a protein sequence changes to others over time. We adopt SMR based on BLOSUM62, which is the default matrix for protein BLAST and is considered to be powerful for detecting most weak protein similarities.

In the second step, we apply weighted sparse representation based classifier (WSRC), a variant of traditional SRC, to classify the interacting and noninteracting protein pairs based on their feature representation. Recently, sparse representation which is originated from signal processing area comes to be a new hot technique. This technique addresses pattern classification problems in a novel way and proves sufficiently robust against illumination variations, occlusions, and random noise. In addition, unlike the traditional sample classifiers such as support vector machine [21, 22] and neural network [23] which need much effort to adjust the best parameters, it needs little manual intervention to use SRC in sample classification. WSRC, which integrates both sparsity and locality structure data, can further improve the classification performance of SRC. For this reason, we use weighted sparse representation based classifier to build a computational classification system for predicting protein interaction.

In this paper, we propose a computational method for predicting PPIs from amino acid sequences combining substitution matrix representation and weighted sparse representation based classifier. More specifically, we first adopt substitution matrix representation based on BLOSUM62 to represent proteins as SMR matrixes. Secondly, we utilize discrete cosine transform to extract a 400-dimensional vector from each protein SMR matrix. As a result, each protein pair is represented by an 800-dimensional feature vector. Finally, WSRC is employed as the machine learning classifier to deal with the classification. The proposed method was evaluated by using three different PPIs datasets:* Yeast*,* Human*, and* H. pylori*. To further evaluate the performance of the proposed method, we compare it with the state-of-the-art support vector machine classifier. Extensive cross-species experiments were also performed on five independent PPIs datasets. In these experiments, we used experimentally identified interactions in one organism to predict the interactions in other five organisms assuming that homolog proteins preserve their ability to interact. The experimental results show that the proposed method performs significantly well in distinguishing interacting and noninteracting protein pairs. Achieved results demonstrate that the proposed approach outperforms all other previous methods on a couple of PPI datasets and can be a useful supplementary tool to traditional experimental method.

#### 2. Materials and Methodology

##### 2.1. Godden Standard Datasets

We verify the proposed method on a high-confidence* Saccharomyces cerevisiae* PPIs dataset which is gathered from publicly available database of interacting proteins (DIP). The protein pairs with less than 50 residues are removed because they might just be fragments. The protein pairs with too much sequence identity are generally considered to be homologous; thus the pairs which have ≥40% sequence identity are also deleted in order to eliminate the bias to these homologous sequence pairs. By doing this, we got the remaining 5594 protein pairs which construct the positive dataset. For constructing the negative dataset, we selected 5594 additional protein pairs of different subcellular localizations to build the negative dataset. Consequently, the whole dataset is made up of 11188 protein pairs of which half are from the positive samples and half are from the negative samples.

In order to demonstrate the generality of the proposed method, we also verify our approach on two other types of PPIs datasets. We collected the first dataset from the* Human* Protein References Database (HPRD). We removed those protein pairs which have ≥25% sequence identity. Finally, to comprise the golden standard positive dataset, we used the remaining 3899 protein-protein pairs of experimentally verified PPIs from 2502 different human proteins. For golden standard negative dataset, we followed the previous work [24] assuming that the proteins in different subcellular compartments do not interact with each other and finally obtained 4262 protein pairs from 661 different* human* proteins as the negative dataset. As a result, the* Human* dataset is constructed by 8161 protein pairs. The second PPI dataset is constructed by 2916* Helicobacter pylori* protein pairs (1458 interacting pair and 1458 noninteracting pairs) as described by Martin et al.

##### 2.2. Substitution Matrix Representation

Substitution matrix representation is a variant of representation method proposed by [36]. In this novel matrix representation for proteins, a matrix would be generated to represent a given -length protein sequence based on a substitution matrix. In our work, we applied BLOSUM62 matrix, a popular substitution matrix used for sequence alignment of proteins, to this transformation. SMR can be depicted as follows:where denotes the BLOSUM62 which is a substitution matrix and represents the probability rate of amino acid mutating to amino acid in the evolution process; is the given protein sequence constructed by amino acids.

##### 2.3. Discrete Cosine Transform

Discrete cosine transform (DCT) first proposed by Ahmed et al. [37] is a popular linear separable transformation in the lossy signal and image compression processing for its powerful energy compaction property. In DCT algorithm, an input signal would be converted into elementary frequency components. In addition, small high-frequency components would be discarded, which can approach high compaction efficiency. Discrete cosine transform can be defined as follows:where is the input signal matrix and here denotes the SMR matrix. In our work, the final DCT feature descriptor which represents a protein sequence is obtained by choosing the first 400 coefficients.

##### 2.4. Weighted Sparse Representation Based Classification (WSRC)

With the advancement in mathematical studies on linear representation methods (LRBM) and compressed sensing (CS) theory, sparse representation has recently earned considerable attention in field of signal processing, computer vision, and pattern recognition. The sparse representation based classification (SRC) [38, 39] assumes that it is sufficient to represent a given test sample by using samples from the sample subject. Based on this viewpoint, SRC computes a sparse representation matrix in a specific optimizing strategy aiming to build a linear combination of training set to represent the given test sample. Employing the sparse representation matrix, reconstruction residuals of each class would be calculated and the test sample will be finally assigned to the class which has the minimum reconstruction residual.

Given a training sample matrix which is made up of -dimensional training samples, assume that sufficient training samples belonging to the th class. Samples of th class can make up a submatrix , where denotes the label of th sample and is the number of samples belonging to th class. Thus, the sample matrix can be further rewritten as , where is the class number of the whole samples. Given a test sample , SRC represents it with the linear combination of training samples of th class:which can be further symbolized with the consideration of the whole training set representation as follows:where . For the reason that the nonzero entries in are only associated with the th class, when the class number of samples is large, would come to be sparse. The key of SRC algorithm is searching the vector which can not only satisfy (5) but also minimize the -norm of itself:However, problem (6) is NP-hard and hard to be solved precisely. According to the theory of compressive sensing, when is sparse enough, it is feasible to solve the related convex -minimization problem to avoid solving the solution of -minimization problem directly: Dealing with occlusion, (7) needs to be extended to the stable -minimization problem: where denotes to the tolerance of reconstruction error. Given the solution from (8), the SRC algorithm assigns the label of test sample to class based on the following rule:

Then, traditional SRC represents a test sample as a sparse combination of training sample and assigns it to the class which minimizes the residual between itself and .

Nearest Neighbor (NN) is another distinct classifier which classifies the test sample by only using its Nearest Neighbor in training data. It utilizes the locality structure of data but easily suffers from noise. Locality measures the similarity between the query and training samples and comes to be a key issue in the fields of clustering, dimension reduction, density estimation, anomaly detection, and image classification. Researches [15, 40, 41] show that, in some case, locality is more essential than sparsity. Although SRC uses the linearity structure of data and overcomes the drawback of NN, the original sparse coding fails to guarantee being local which could cause instability. For this reason, it has sufficient reasons to integrate the locality structure of data into sparse representation. Lu et al. [42] have recently proposed a variant of traditional sparse representation based classifier called weighted sparse representation based classifier (WSRC). This variant classifier possesses the advantages of both the traditional sparse representation based classifier and the Nearest Neighbor classifier. Appropriate kernel methods map samples into a high-dimensional feature space and usually lead to a better performance in classification process. For this reason, WSRC first utilizes distance based on Gaussian kernel to measure the similarity between the samples. Gaussian-based distance can be described as follows:where denote two samples and is the Gaussian kernel width. These Gaussian distance values are then used as the weights of each sample in training sets and adjust training sample matrix into a new matrix [43, 44]. In this way, weight sparse representation based classifier is capable of retaining the locality structure of data. WSRC turn to solve the following problem:where is a block-diagonal matrix of locality adaptor and is the sample number of training set in class . Dealing with occlusion, WSRC finally solves the following stable -minimization problem: where is the tolerance value.

Given all these, the WSRC algorithm can be summarized as follows.

*Algorithm 1. *Weighted Sparse Representation Based Classifier (WSRC)(1)Input: training samples matrix and any test sample .(2)Normalize the columns of to have unit -norm.(3)Calculate the Gaussian distances between and each sample in and make up matrix . Use matrix to adjust the training samples matrix to .(4)Solve the stable -minimization problem defined in (12).(5)Compute each residual of classes: ().(6)Output: the prediction label of as .

#### 3. Results and Discussion

##### 3.1. Evaluation Measures

To evaluate the performance of the proposed method, we use the following criteria: the overall prediction accuracy (Accu.), sensitivity (Sens.), precision (Prec.), and Matthews’s correlation coefficient (MCC) were calculated. They are defined as follows:where true positive (TP) denotes the number of true samples which are predicted correctly; false negative (FN) is the number of true samples predicted to be noninteracting pairs incorrectly; false positive (FP) is the number of true noninteracting pairs predicted to be PPIs falsely; and true negative (TN) is the number of true noninteracting pairs predicted correctly. Furthermore, the receiver operating characteristic (ROC) curves are also computed for evaluating the performance of proposed method. Summarizing ROC curve in a numerical way, the area under an ROC curve (AUC) is computed.

##### 3.2. Assessment of Prediction Ability

For the sake of fairness, the corresponding parameters of weighted sparse representation based classifier were set the same when explored in three different datasets—*Yeast*,* Human*, and* H. pylori*. Here, and . In addition, 5-fold cross-validation was employed in our experiments in order to avoid the overfitting of the prediction model and test the stability of the proposed model [45]. Specifically, the whole dataset was divided into five parts where four parts were used for training and one part was used for testing. In this way, we obtained 5 models from the original dataset and each one of them was experimented solely. The prediction results of SRC prediction models with substitution matrix representation based description of protein sequence on three datasets are shown in Tables 1–3.