BioMed Research International

Volume 2016 (2016), Article ID 4783801, 9 pages

http://dx.doi.org/10.1155/2016/4783801

## Using the Relevance Vector Machine Model Combined with Local Phase Quantization to Predict Protein-Protein Interactions from Protein Sequences

^{1}School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 21116, China^{2}Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China

Received 5 March 2016; Accepted 12 April 2016

Academic Editor: Xun Lan

Copyright © 2016 Ji-Yong An et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on* Yeast* and* Human* datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the* Yeast* dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research.

#### 1. Introduction

Proteins are crucial molecules that participate in many cellular functions in an organism. Typically, proteins do not perform their roles individually, so detection of PPIs becomes more and more important. Knowledge of PPIs can provide insight into the molecular mechanisms of biological processes and lead to a better understanding of practical medical applications. In recent years, various high-throughput technologies, such as yeast two-hybrid screening methods [1, 2], immunoprecipitation [3], and protein chips [4], have been developed to detect interactions between proteins. Until now, a large quantity of PPI data for different organisms has been generated, and many databases, such as MINT [5], BIND [6], and DIP [7], have been built to store protein interaction data. However, these experimental methods have some shortcomings, such as being time-intensive and costly. In addition, the aforementioned approaches suffer from high rates of false positives and false negatives. For these reasons, predicting unknown PPIs is considered a difficult task using only biological experimental methods.

As a result, a number of computational methods have been proposed to infer PPIs from different sources of information, including phylogenetic profiles, tertiary structures, protein domains, and secondary structures [8–16]. However, these approaches cannot be employed when prior knowledge about a protein of interest is not available. With the rapid growth of protein sequence data, the protein sequence-based method is becoming the most widely used tool for predicting PPIs. Consequently, a number of protein sequence-based methods have been developed for predicting PPIs. For example, Bock and Gough [17] used a support vector machine (SVM) combined with several structural and physiochemical descriptors to predict PPIs. Shen et al. [18] developed a conjoint triad method to infer human PPIs. Martin et al. [19] used a descriptor called the signature product of subsequences and an expansion of the signature descriptor based on the available chemical information to predict PPIs. Guo et al. [20] used the SVM model combined with an autocorrelation descriptor to predict* Yeast* PPIs. Nanni and Lumini [21] proposed a method based on an ensemble of K-local hyperplane distances to infer PPIs. Several other methods based on protein amino acid sequences have been proposed in previous work [22, 23]. In spite of this, there is still space to improve the accuracy and efficiency of the existing methods.

In this paper, we propose a novel computational method that can be used to predict PPIs using only protein sequence data. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise by using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. More specifically, we first represent each protein using a PSSM representation. Then, a LPQ descriptor is employed to capture useful information from each protein PSSM and generate a 256-dimensional feature vector. Next, dimensionality reduction method PCA is used to reduce the dimensions of the LPQ vector and the influence of noise. Finally, the RVM model is employed as the machine learning approach to carry out classification. The proposed method was executed using two different PPIs datasets:* Yeast* and* Human*. The experimental results are found to be superior to SVM and other previous methods, which prove that the proposed method performs incredibly well in predicting PPIs.

#### 2. Materials and Methodology

##### 2.1. Dataset

To verify the proposed method, two publicly available datasets are used in our study. The datasets are* Yeast* and* Human* that were obtained from the publicly available Database of Interaction Proteins (DIP) [24]. For better implementation, we selected 5594 positive protein pairs to build the positive dataset and 5594 negative protein pairs to build the negative dataset from the* Yeast* dataset. Similarly, we selected 3899 positive protein pairs to build the positive dataset and 4262 negative protein pairs to build the negative dataset from the* Human* dataset. Consequently, the* Yeast* dataset contains 11188 protein pairs and the* Human* dataset contains 8161 protein pairs.

##### 2.2. Position Specific Scoring Matrix

A Position Specific Scoring Matrix (PSSM) is an matrix for a given protein, where is the length of the protein sequence and 20 represents the 20 amino acids [28–33]. A score is allocated for the amino acid in the position of the given protein sequence in the PSSM. The score of the position of a given sequence is expressed as , where is the ratio of the frequency of the amino acid appearing at position of the probe to be the total number of probes and is the value of Dayhoff’s mutation matrix [34] between the and th amino acids [35–37]. As a result, a high score represents a largely conserved position and a low score represents a weakly conserved position [38–40].

PSSMs are used to predict protein folding patterns, protein quaternary structural attributes, and disulfide connectivity [41, 42]. Here, we also use PSSMs to predict PPIs. In this paper, we used the Position Specific Iterated BLAST (PSI-BLAST) [43] to create PSSMs for each protein sequence. The -value parameter was set as 0.001, and three iterations were selected for obtaining broadly and highly homologous sequences in the proposed method. The resulting PSSMs can be represented as 20-dimensional matrices. Each matrix is composed of elements, where is the total number of residues in a protein. The rows of the matrix represent the protein residues, and the columns of the matrix represent the 20 amino acids.

##### 2.3. Local Phase Quantization

Local Phase Quantization (LPQ) has been described in detail in the literature [44]. The LPQ method is based on the blur invariance property of the Fourier phase spectrum [45–47]. It is an operator used to process spatial blur in textural features of images. The spatial invariant blurring of an original image apparent in an observed image can be expressed as a convolution, given bywhere is the function of the spread point of the blur, represents two-dimensional convolutions, and is a vector of coordinates . In the Fourier domain, this amounts towhere , , and are the discrete Fourier transforms (DFT) of the blurred image , the original image , and , respectively, and is a vector of coordinates . According to the characteristic of the Fourier transform, the phase relations can be expressed asWhen the spread point function is the center of symmetry, meaning , the Fourier transform of always has a real value. As a result, its phase can be expressed as a two-valued function, given byThis means thatThe shape of the point spread function is similar to the Gaussian or Sin function. This ensures that and at low frequencies, which means that the phase characteristics are due to blur invariance. The local phase information can be extracted using the two-dimensional DFT in LPQ. In other words, a short-term Fourier transform (STFT) computed over a rectangular neighborhood at each pixel position of an image is represented bywhere is the basis vector of the two-dimensional DFT at frequency and is another vector containing all image samples from . Using LPQ, the Fourier coefficients of four frequencies are calculated: , , , and , where is a small enough number to satisfy . As a result, each pixel point can be expressed as a vector, given byThen, using a simple scalar quantizer, the resulting vectors are quantized, given bywhere is the component of . After quantization, becomes an eight-bit binary number vector, and each component of is assigned a weight of . As a result, the quantized coefficients are represented as integer values between 0 and 255 using binary coding

Finally, a histogram of these integer values from all image positions is composed and used as a 256-dimensional feature vector in classification. In this paper, the PSSM matrixes of each protein from the* Yeast* and* Human* datasets were converted to 256-dimensional feature vectors using this LPQ method.

##### 2.4. Principal Component Analysis

Principal Component Analysis (PCA) is widely used to process data and reduce the dimensions of datasets. In this way, high-dimensional information can be projected to a low-dimensional subspace, while retaining the main information. The basic principle of PCA is as follows.

A multivariate dataset can be expressed as the following matrix :where is the number of variables and is the number of samplings of each variable. PCA closely related to singular value decomposition (SVD) of matrix and the singular value decomposition of matrix as follows:where represent feature vector of and represent feature vector of and is singular value. If there are linear relationships between variables, then singular values are zero. Any line of can be expressed as feature vector :where is projection on , feature vector is load vector, and is score.

When there is a certain degree of linear correlation between the variables of matrix, then the projection of final several load vectors of matrix will be enough small for resulting from measurement noise. As a result, the principal decomposition of matrix is represented bywhere is error matrix and can be ignored. This does not bring about the obvious loss of useful information of data. In this paper, for the sake of reducing the influence of noise and improving the prediction accuracy, we reduce the dimensionality of the* Yeast* dataset from 256 to 180 and dimensionality of the* Human* dataset from 256 to 172 in the proposed method by using Principal Component Analysis.

##### 2.5. Relevance Vector Machine

The characteristics of the Relevance Vector Machine have been described in detail in the literature [48]. For binary classification problems, assume that the training sample sets are , is the training sample, represents the training sample label, represents the testing sample label, and , where is the model of classification prediction; is additional noise, with a mean value of zero and a variance of , where . Assuming that the training sample sets are independent and identically distributed, the observation of vector obeys the following distribution [49–51]:where is defined as follows:

The RVM uses sample label to predict the testing sample label , given byTo make the value of most components of the weight vector zero and to reduce the computational work of the kernel function, the weight vector is subjected to additional conditions. Assuming that obeys a distribution with a mean value of zero and a variance of , the mean , , where is a hyperparameters vector of the prior distribution of the weight vector . Hence,Because cannot be obtained by an integral, it must be resolved using a Bayesian formula, given byThe integral of the product of and is given byBecause and ) cannot be solved by means of integration, the solution is approximated using the maximum likelihood method, represented by The iterative process of and is as follows:where is element on the diagonal of and the initial value of and can be determined via the approximation of and by continuously updating using formula (21). After enough iterations, most of will be close to infinity, the value of the corresponding parameters in will be zero, and other values will be close to finite. The resulting corresponding parameters of are now referred to as the relevance vector.

##### 2.6. Procedure of the Proposed Method

In the paper, our proposed method contains three steps: feature extraction, dimensionality reduction using PCA, and sample classification. The feature extraction step contains two steps: (1) each protein from the datasets is represented as a PSSM matrix and (2) the PSSM matrix of each protein is expressed as a 256-dimensional vector using the LPQ method. Dimensional reduction of the original feature vector is achieved using the PCA method. Finally, sample classification occurs in two steps: (1) the RVM model is used to carry out classification based on the datasets from* Yeast* and* Human* whose features have been extracted and (2) the SVM model is employed to execute classification on the dataset of* Yeast*. The flow chart of the proposed method is displayed in Figure 1.