#### Abstract

Our goal in this present study is to introduce new wavelet based methods for differentiating and classifying Class I and Class II PDZ domains and compare the resulting signals. PDZ domains represent one of the most common protein homology regions playing key roles in several diseases. To perform the classification, we developed two methods. The first of our methods was comparable to the standard wavelet approaches while the second one surpasses it in recognition accuracy. Our models exhibited interesting results, and we anticipate that it can be used as a computational technique to screen out the misfit candidates and to reduce the search space, while achieving high classification and accuracy.

#### 1. Introduction

PDZs are structural domains contained in many proteins. They have been shown to act as key players involved in numerous diseases mediated with the PDZ domain interactions they contain [1–3]. PDZs represent one of the most common protein domains found in the human genome that are made up of approximately 90 amino acids. They were initially identified based on the homology of three different proteins from which their name is derived. One of the main functions that PDZ domains have been recognized for is their scaffolding or mediator role [4–6] in the assembly of receptors at the cellular membrane interface. These reactions are governed by PDZ domain binding of C-termini of their ligand protein, more specifically the last four to six amino acid residues [7–9]. Overall, there are several PDZ domain classifications; however, the two most dominant recognition motifs are -/-- for Class I PDZ domains and --- for Class II PDZ domains [10]. Most PDZ domains share a similar 3D globular module [11] as shown by NMR and crystallography structures. In general, PDZs are composed of six beta strands and two alpha helices; however, predicting peptide-PDZ domain interaction can be difficult. Our goal was to generate a computational model that could aid in interaction classifications. Given that PDZ domains often exhibit strong binding towards certain peptide ligands, they can be classified based on the recognition sequence of interacting peptides. However, with more recent reports, it is clear that PDZ domains display significant promiscuity in binding [12], proving to be a challenge for their classification. By extrapolating the relevant sequence information and using several wavelet transformation, we intended to develop a method that would be time saving and offer an alternative to expensive experiments. Work on this topic by Kalyoncu et al. [13] was based on using a statistical learning model for automated prediction of PDZ domains. Chen et al. [14] used a statistical model to perform prediction studies based on domain-peptide interactions. Another interesting classification technique according to the two critical positions of PDZ domains has been introduced by Bezprozvanny and Maximov [15]. Chen et al. [14] used a multidomain selectivity model to predict PDZ domain-peptide interactions across the mouse proteome. However, in this study, we have converted the biological prediction problem into an engineering challenge guided by mapping protein domain sequences into signals based on the 7 key physiochemical properties of amino acids (hydrophobicity [16], electronic [17], isoelectric point [18], polarity [19], volume [19], composition [20], and molecular weight [20]).

In this paper, the problem statement is as follows: given any PDZ sequence = GTRITLEEITLERA, predict to which class of PDZ (I or II) belongs.

The core of our study is the feature extraction from the amino acid sequences. In order to extract features from our dataset, we have used wavelets-based signal processing methodologies because it retains more abundant information of sequence order in frequency domain and time domain [21], which is a key to our method. In this study, we were able to develop two novel methods for feature extraction and classification. Feature extraction in the first method named WAD-1 is achieved by first smoothing the signal incorporating empirical mode decomposition (EMD) [1] and second treating normalized signals with maximum overlap discrete wavelet transform (MODWT) [22].

Our second method named WAD-2 uses trigram frequencies of amino acids followed by MODWT for feature extraction. The idea of amino acid frequency has been used popularly in many past [23, 24] and recent [13, 25, 26] proteomics analysis with various mathematical and statistical models.

Finally, we used different classifiers, namely, parametric classifier [27] and K-nearest neighbors [27], to classify these features. The performance of current methods showed improvement compared with wavlets only method in Table 1.

#### 2. Methods

##### 2.1. Empirical Mode Decomposition (EMD)

According to Huang et al. [28], the empirical mode decomposition (EMD) [28] is a technique used to decompose a given signal into a set of elementary signals called “intrinsic mode functions” (IMFs). The algorithm operates by removing IMFs in every iteration. Rato et al. [29] has proposed an improved EMD algorithm, which is as follows from their paper.

Presenting any signal, , the IMFs are generated by an iterative activity titled sifting operations, which consist of the following steps.(i)Generate each and every localized maxima, , , and so forth, and minima, , … and so on, in .(ii)Generate the interpolating signals for maxima as and for minima as . The interpolating signals thus computed contributes to the upper and lower envelopes of the signal. (iii)Consider .(iv)Remove from the original signal , and update it as .(v)Go back to step (i) and continue until produced in step (iv) remains almost unvaried. (vi)After obtaining an IMF , subtract it from updated as , and jump to step (i) if multiple extremum (a maxima or minima) for is noticed.

This IMF can be mathematically defined as [29]

We also compute a ratio,

Here, and , respectively, indicate the energy of the original signal before sifting and the average energy of the upper and lower envelopes. However, if crosses-over an allowed threshold of resolution , then the IMF computation is stopped.

##### 2.2. Maximum Overlap Discrete Wavelet Transform (MODWT)

MODWT was described by Percival and Walden [22]; it is a special case among the currently available wavelet-based techniques for the analysis of arbitrary-length discrete time series. MODWT is different from DWT in a sense that it is a circular shift-invariant transform [6]; the idea here is that, if a circular shift operation is applied to the real time series data, then it generates identically shifted MODWT coefficients. The MODWT is well suited for any sequence length , whereas for a complete decomposition of levels the DWT requires to be a multiple of .

Moreover, we can interpret the MODWT as a cyclic version of the DWT, and it is achieved by averaging over all nonredundant DWTs of shifted versions of the original series. However as this operation smoothens the original DWT signals, it also increases the running time of the computer program. If we apply DWT on an -point time series, it requires multiplications whereas MODWT for the same series requires ) multiplications [30].

It is well suited for our study because, once the protein sequence is translated to a numerical sequence, it becomes a time-series sequence . In order to filter the signals at each level of , MODWT treats the time series as a periodical. The MODWT coefficients are given by [31] Here, , are the high and low pass filters, respectively, of level . It is also evident from the above equations that the unseen values are the same as the observed values , which indicates that MODWT induces “cyclic boundary conditions” during wavelet transformation.

The cyclic boundary condition can be difficult to implement in case of nonperiodic signals that exhibit discontinuities between start and end times [31]; therefore, common adoption [31] is “reflection boundary conditions”, in which the time series is extended to instead of . Due to reflection symmetry, the unseen values are assigned to the seen values , ,,…. Therefore, (3a) and (3b) can be rewritten as in [31]:Here, is the extension of .

#### 3. Prediction with WAD-1

The improved EMD normalizes the signal to a unit power [29], which was not the case with original EMD [28]. Hence, improved EMD is more suitable for low amplitude biomedical signals [29] since it keeps all the components of a signal and does not decompose into different levels of resolution. The improved EMD algorithm also does not reduce the feature space. Feature space reduction is an important part of our job because high dimensional feature space increases the running time of prediction algorithms.

Therefore, we used a combination of MODWT and improved EMD for our feature extraction part because improved EMD preprocesses a signal by interpolating all its maxima and minima, which in turn enhances the signal fit. The signal thus generated is fed to MODWT for feature extraction because it preserves more abundant information of sequence order in both frequency as well as time domains, and at the same time it reduces the feature space, which in turn improves the run time complexity of the classifiers. In essence, with our combination, we achieved reducing the feature space without losing sequence order information.

Our dataset is primarily based on the dataset of Tonikian et al. [32]. We have extracted only human PDZ domains for our work from their dataset.

Our first algorithm WAD-1 operates in four phases as shown in Figure 1. It can be seen from Figure 1 that, in the first phase, mapping from symbolic domain to numeric equivalent is performed. For WAD-1 method this mapping is based on the 7 physiochemical properties of amino acids; that is, every amino acid in each sequence is converted to a real number. With this operation, we obtain time-series data. We smoothen this signal by applying EMD. Let indicate a smoothened signal, where indicates the smoothened signal value corresponding to the th position in the sequence.

We then applied MODWT operation of level on as follows from [33]: Here, is a nonorthogonal real matrix of dimension . The MODWT coefficient vector from (1) can be resolved into vector as in [33]: Here, () implies the length of vector of wavelet coefficients associated with the change on scale of length , and implies the length of vector of scaling coefficients associated with averages on scale of length .

From [22, 34], it is evident that MODWT transformation preserves energy of the signal. The following equation proposed from Gupta et al. [33] is shown here:

The variance of can be written similarly as in Gupta et al. [33]:

The feature vector generated through WAD-1 is influenced by the work of Gupta et al. with G protein coupled receptors (GPCRs) [33], and it is made up of the variances of several physiochemical properties of the amino acids in the PDZ sequence. For example, using (8), WAD-1 computes the part of the feature vector for a PDZ sequence based on EIIP values as

Here, indicates upper limit of the level of decomposition. The complete feature vector is obtained by concatenating 7 such feature vectors computed for 7 physiochemical properties of amino acids.

In this study, we chose the least asymmetric filter (LA8) for the wavelet transformation phase because it has a filter width short enough that any impact caused by boundary conditions stays within the tolerance limit.

We have used ANOVA in the feature extraction phase as this analysis is useful in comparing multiple variables for statistical significance. Gupta et al. [33] have used this method to extract features from amino acid sequences in the classification study of GPCRs.

This feature extraction is the most important phase in the overall classification and analysis since it yields the feature vector (FV), whose dimension is function of , the number of levels chosen for decomposing the signal. For , (as in our case) the dimension of the feature vector is only . This is a great reduction that improves the run time of the classification program. In the final phase, these FVs are fed to the classifier for prediction of PDZ domain.

For MODWT calculations in the 3rd phase, we have used the WMTSA wavelet toolkit [31] from MATLAB. With WAD-1 we are able to reduce the dimensionality of the feature space and at the same time retain the positional information of the amino acids in the original sequence. Table 1 clearly demonstrates the consistency of new WAD-1 when compared with popular MODWT.

#### 4. Improved Prediction with WAD-2

In this method, we extract the trigram frequencies from the amino acids as previously reported [13]. For an amino acid sequence such as EITLERG, it has the following trigrams: EIT, ITL, TLE, LER, and ERG.

We calculate trigram frequency of amino acids of every PDZ sequence in the following way.(i)First, count the number of times a trigram appears in the sequence.(ii)Then, divide this number by the total number of trigrams in the sequence; in fact, it is found to be (), where denotes length of a PDZ sequence.(iii)Repeat steps (i) and (ii) for every possible trigram for the PDZ sequence.

We reduce the dimensions of the features by applying the idea introduced by Kalyoncu et al. [13], where 20 amino acids are grouped into 7 distinct classes based on dipoles and volumes.

The block diagram of this method is shown in Figure 2, and we call it WAD-2. It operates in 3 phases.

Once the symbolic to numeric mapping is done, we applied MODWT as in WAD-1 for feature extraction.

We have used “PCP: a program for supervised classification of gene expression profiles” [27] in our classification phases for both WAD-1 and WAD-2.

#### 5. Results and Discussions

In this paper, we show proof-of-principle application for our algorithms. We evaluated our approaches by 2 different classifiers and also compared them to a wavelet method, MODWT. We observed that our results are comparable with the wavelet method, and while our second method WAD-2 has shown promising results, the first method WAD-1 is able to produce consistent results (see Table 1).

Figure 4 shows the signal based on EIIP values. We show the resulting signals decomposed up to 5 levels by our first method in Figure 5(a). It shows a cumulative abstraction of the variations in the data over regions progressive to the wavelets scales with coefficients at higher levels; therefore, it depicts features at broader time intervals with the increase in the values of .

We show the feature vector obtained for a PDZ sequence by our first method based on EIIP, isoelectric, composition, polarity, volume, and molecular weight values in Figures 5(b)–5(g), respectively. It is evident from these figures that wavelet variance value diminishes to almost zero for higher levels (after 3rd level) for almost all the physiochemical values. This clearly indicates that this wavelet variance vector has 3 principal components. The comparison between two feature extractions is depicted in Figure 3, from which it is clear that our signals are smoother with sharper peaks when compared to MODWT signals.

**(a) Feature vectors of both classes of PDZ computed using WAD-1 (EMD smoothening followed by MODWT)**

**(b) Feature vectors of both classes of PDZ computed using MODWT alone**

**(a) Wavelet coefficients of the mapped protein sequence up to level 5 for EIIP value**

(b) Wavelet variance vector, for EIIP value, obtained for the input protein sequence |

(c) Wavelet variance vector, for isoelectric value, obtained for the input protein sequence |

(d) Wavelet variance vector, for composition value, obtained for the input protein sequence |

(e) Wavelet variance vector, for polarity value, obtained for the input protein sequence |

(f) Wavelet variance vector, for volume value, obtained for the input protein sequence |

(g) Wavelet variance vector, for molecular weight value obtained for the input protein sequence |

Table 1 shows the comparative predictive performance of WAD-1, WAD-2, and MODWT in terms of four statistical measures: sensitivity, specificity, negative predictive value, and positive predictive value.

Sensitivity and specificity [35] are statistical measures, which help to evaluate the performance of a binary classification method. In order to understand the significance of sensitivity and specificity in our work, we need to comprehend a few more terms like true positive, false positive, true negative, false negative, which are defined as(a)true positive: the sample belongs to Class 1, and the algorithm also recognizes it,(b)false positive: the sample belongs to Class 2, but the algorithm recognizes it otherwise,(c)true negative: the sample belongs to Class 2, and the algorithm also recognizes it,(d)false negative: the sample belongs to Class 1, but the algorithm recognizes it otherwise.

Therefore, mathematically, sensitivity and specificity are defined as

According to Akobeng [36], sensitivity and specificity are important measures of the uninflected statement of an assay but cannot be used to categorize the future developments of the sickness in an individual. But other measures like positive and negative predictive values unveil the prospects regarding the probability of a prediction being correct.

Positive predictive value [37] can be described mathematically as Here,?? = number of samples that actually belongs to Class??1 and = number of samples that does not actually belong to Class 1.

Similarly, negative predictive value [37] is defined mathematically as Here, = number of samples that actually belongs to Class??2 and = number of samples that does not actually belong to Class 2.

Both NPV and PPV are crucial estimations of the performance of a prediction algorithm because PPV estimates the probability that a true prediction is actually true, and NPV confirms that a false prediction is actually false.

From the context of medical diagnostics, if we consider Class 1 samples as positives and Class 2 samples as negatives, then a high negative predictive value means that the method very rarely recognizes a negative sample (Class 2) as a healthy one (Class 1). In this regard, our second method performs very well (see Table 1).

However, NPV does not consider cases when a positive sample is identified as a negative one, or in our case it, does not give any idea on the chances of misclassification of sample from Class 1 to Class 2.

Therefore, we calculated PPV which complements this drawback of NPV. Similar arguments also apply in case of predictive drawbacks of PPV.

From Table 1, we observe that WAD-1 performed considerably well and WAD-2 surprised us in recognition accuracy. Kalyoncu et al. [13] have done a sophisticated computational study on interaction prediction and classification of PDZ domains, where they achieved an excellent maximum sensitivity of 90.7%. With our WAD-2 method, we are able to achieve similar results with the highest sensitivity of 91.67%.

#### 6. Conclusions

In this work, we have successfully classified PDZ domains of Classes I and II only. We introduced a new method for feature extraction: an EMD smoothing of signals followed by a MODWT transform. We further introduced WAD-2 method based on the trigram frequency of amino acids and found that the results were improved in terms of sensitivity and accuracy. We note that our second WAD-2 method performed better in recognition accuracy than the WAD-1 method. As mentioned earlier, our work in this paper is meant as a proof-of-principle application for our algorithms. We are enthusiastic that further improvement of our algorithm can lead to even better accuracy and predicting of the promiscuity of PDZ domains.

#### Abbreviations

EMD: | Empirical mode decomposition |

MODWT: | Maximum overlap discrete wavelet transform |

IMFs: | Intrinsic mode functions |

GPRC: | G protein coupled receptors |

FV: | Feature vector |

ANOVA: | Analysis of variance |

PPV: | Positive predictive value |

NPV: | Negative predictive value. |

#### Acknowledgments

This study was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, Saudi Arabia under Grant no. (903-004-D1434). The authors thank the DSR for the technical and financial support.