Advances in Bioinformatics

Volume 2015, Article ID 909765, 10 pages

http://dx.doi.org/10.1155/2015/909765

## CISAPS: Complex Informational Spectrum for the Analysis of Protein Sequences

^{1}Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, UK^{2}Department of Computer Science and Digital Technologies, Faculty of Engineering and Environment, The University of Northumbria at Newcastle, Newcastle-upon-Tyne NE1 8ST, UK^{3}Department of Computer Engineering, Yildiz Technical University, 34220 Istanbul, Turkey

Received 28 July 2014; Revised 27 November 2014; Accepted 4 December 2014

Academic Editor: Tatsuya Akutsu

Copyright © 2015 Charalambos Chrysostomou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Complex informational spectrum analysis for protein sequences (CISAPS) and its web-based server are developed and presented. As recent studies show, only the use of the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is proven to be insufficient. Therefore, CISAPS is developed to consider and provide results in three forms including absolute, real, and imaginary spectrum. Biologically related features to the analysis of influenza A subtypes as presented as a case study in this study can also appear individually either in the real or imaginary spectrum. As the results presented, protein classes can present similarities or differences according to the features extracted from CISAPS web server. These associations are probable to be related with the protein feature that the specific amino acid index represents. In addition, various technical issues such as zero-padding and windowing that may affect the analysis are also addressed. CISAPS uses an expanded list of 611 unique amino acid indices where each one represents a different property to perform the analysis. This web-based server enables researchers with little knowledge of signal processing methods to apply and include complex informational spectrum analysis to their work.

#### 1. Introduction

If it is considered that a protein’s biological function is controlled by a selective ability of the protein to interact with selected elements in the environment, the following argument arises: how is this selective ability achieved? Several attempts have been made to decode such characteristic features that help drive biological functions of the proteins directly from primary structure of a protein sequence. One common method used for analysing protein sequences to determine biological functions is based on the search for similarities in the arrangements between the groups of sequences. One example is the basic local alignment search tool (BLAST) [1]. Another method for analysing macromodule sequences is to extract structural and physicochemical features, such as amino acid composition and dipeptide composition derived from the primary structure of a protein sequence. These features can be used for various purposes that include prediction of protein structural classes [2, 3], functional classes [4, 5], and protein-protein interactions [6, 7].

In recent years, signal processing techniques have been used in bioinformatics to extract information that is expected to reveal protein’s biological function [8–11]. One of the methods that use discrete Fourier transform (DFT) is informational spectrum analysis (ISA) [12, 13]. In previous applications where ISA was used for each group of proteins analysed [12, 13] there was a group of proteins that correspond to specific peaks in the frequency spectrum. Every biological function corresponds to one unique or a set of unique peaks. The importance of this general conclusion is that specific biological functions can be extracted from protein sequences using signal processing techniques by identifying significant features of the frequencies which are not found in unrelated frequencies. However, complementary information such as real and imaginary frequency spectra can be derived from DFT which has successfully been used in various areas including biomedicine [14] but was not previously explored in the analysis of protein sequences. A new method, the complex informational spectrum [15], was proposed and developed, which considers all three frequency spectra for analysing protein sequences, in order to identify new and complementary information in relation to functional properties of the proteins under investigation.

In the traditional approach, due to the complex nature of proteins and their functional groups, the use of only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is insufficient, as biologically related features to the analysis of protein sequences can be more distinct either in the real or the imaginary spectrum. Various applications, such as development of new drugs [16], identification of important protein sequence’s domains [17], and investigation of protein sequences interaction [18], where ISA and resonant recognition model (RRM) [19] are already applied in the literature, and complex informational spectrum analysis (CISA) [15] will also be applicable and will be able to contribute additional information.

To be able to proceed with current signal processing techniques, a set of numerical values must be assigned to nucleotides or amino acids [20]. These values should represent natural biological characteristics of the macromodules with which they are paired and be relevant to the biological activity of each module. These values can be any of the biochemical properties such as electron-ion interaction potential (EIIP) [21, 22], hydrophobicity [21, 23], solubility [21, 23], or molecular weight [21, 23].

In this paper we introduce CISAPS (complex informational spectrum for the analysis of protein sequences) web server which can be freely accessed to extract features of proteins from their amino acid sequences using the CISA. This is further supported by using an expanded set of amino acid indices (AAI). Application of the CISA in the influenza virus is also presented as a case study in order to show usefulness and robustness of the method developed.

#### 2. Methods and Materials

##### 2.1. Signal Processing for Protein Sequence Analysis

By using digital signal processing techniques the goal is to extract information that can be related to biological functions of proteins. Various signal processing methods have been used in bioinformatics for analysing protein sequences in recent years; one of the most common methods is the informational spectrum analysis (ISA) [12, 13]. For the ISA method to be implemented for the analysis of protein sequences, discrete Fourier transform (DFT) is applied after each amino acid of the protein sequences is expressed as numerical sequences by using various AAI. A special case of ISA is the resonant recognition model [12, 13, 22] where the EIIP AAI [22] is used to encode alphabetical protein sequences into numerical sequences. ISA reveals that in related protein sequences common peaks appear in the informational spectrum, whereas they do not appear in functionally unrelated sequences, and this is directly related to the biological property of the AAI used. In previous studies, ISA uses DFT to extract parameters using the absolute spectrum. However, DFT that generates complex output (imaginary and real frequency spectra) has been shown to produce complementary information in various fields such as Doppler ultrasound in medicine [14], polar solvation dynamics in the femtosecond evolution [24], time-domain sum-frequency generation spectroscopy using midinfrared pulse shaping [25], hydrophobic oil droplet-water interface for the orientation, and charge of water [26].

To the best of our knowledge, complex signal processing concept has not been explored for the analysis of protein sequences. Therefore, for the first time, this paper is concerned with the development of the complex informational spectrum (CISA) for the analysis of groups of proteins using their sequence information. This study therefore aims at deriving absolute, real, and imaginary spectra from DFT for a given set of proteins. They will then be used to extract characteristic frequency parameters for the group of proteins under study. This piece of information can be used to characterise and classify protein sequences. In order for researchers to apply the method in their own set of proteins without any knowledge of SP or complex SP concept, a freely accessible web server (CISAPS web server) is also developed and presented.

##### 2.2. Amino Acid Indices

Protein sequences in the literature are expressed using generally 20 alphabetical characters where each one corresponds to a specific amino acid. To be able to apply signal processing methods protein sequences need to be encoded into numerical sequences. This can be achieved using AAI where each of the 20 amino acids is assigned to a specific numerical value. For the analysis, CISAPS server uses 611 unique AAI to encode protein sequences that represent different biochemical properties of the proteins. A list of all the indices can be retrieved from the CISAPS web server. Of these indices, 528 unique indices were extracted from AA index database [20] after manually removing duplicate entries. The remaining 83 AAI out of 611 used in CISAPS server were retrieved from various literature, the details of which can be found in Supplement 1 in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/909765 and the web server (http://sproteomics.com/cisaps/default/indices).

As AAI originated from different sources from the literature, -score [27] is used to normalise each index using where , , and correspond to index value, mean value, and standard deviation, respectively, for a particular index.

##### 2.3. Preprocessing Protein Sequences

Before applying the complex informational spectrum analysis to the numerical sequences, which have now become signals, preprocessing of these signals is needed, in order for the signal processing methods to be applied in and to extract better results. Recent studies [28] have shown that zero-padding and windowing can enhance the features extracted from proteins sequences. Therefore, both techniques described in this section are applied to the complete protein sequences.

The first technique is windowing where the encoded numerical sequences are multiplied by a precalculated window to reduce spectral leakage. The windowing has been shown to reduce or even eliminate spectral leakage in various applications such as harmonic analysis [29] and phase estimation [30] where frequency analysis and DFT were used. In this case, CISAPS uses Hamming window [31] which can be calculated using (2). The Hamming window is used as it is a widely used and accepted window function [32]:

The second technique used is zero-padding in which a specified number of zero elements are added to the end of each sequence to increase signal length. This technique is essential for CISA as the given protein sequences may not be of the same length. In order to achieve zero-padding, CISAPS server gives two options to the user for analysing a given set of proteins. The first option is to set the resolution directly to the maximum allowed length of any given protein which is 4096 and the second is to set the DFT resolution at the greatest length of the protein sequences given by the user.

##### 2.4. Complex Informational Spectrum Analysis

The discrete Fourier transform (DFT) is defined as follows: where is the th member of the numerical series, is the total number of points in the series, and are coefficients of the DFT. As the DFT coefficients consisted of two mirror parts, only the first half of the series points will be hereafter considered. The following formula determines the maximal frequency in the spectrum: where is the maximal frequency of all the signals (protein sequences) and is the distance between points of the sequence.

If it is assumed that all points of the sequence are equidistant with distance , then the maximum frequency in the spectrum can be found as . This shows that the frequency range does not depend on the number of points in the sequence but only the resolution of the spectrum. The output of DFT is a complex sequence and can be represented as follows: where and are the real and imaginary parts of the sequence, respectively.

The aim of this method is to determine a characteristic frequency peak (CFP) using the informational spectrum for each spectrum (absolute, real, and imaginary) that is expected to correlate with a biological function expressed by a group of protein sequences. To determine such a parameter, it is necessary to find common characteristics of the sequences with the same biological function. The absolute, real, and imaginary informational spectrum can be formulated as follows. Absolute spectrum: where is the absolute spectrum for a specific protein, are the DFT coefficients of the series , and are the complex conjugate. Real spectrum, where is the real spectrum for a specific protein and are the real parts of DFT coefficients . Imaginary spectrum, where is the imaginary spectrum for a specific protein and are the imaginary parts of DFT coefficients . Complex informational spectrum, where , , and are the absolute, real, and imaginary informational spectrum, respectively, and is the number of protein sequences used for a specific class of proteins.

Equation (10) is used to scale absolute, real, and imaginary informational spectrum as where is the number of points in the absolute (), real (), and imaginary informational spectrum ().

CFP as a result of the CISA can be used to characterise and distinguish them from another group of proteins. However, the following conditions should be fulfilled for the CFP to be related to a biological function.(1)Only one CFP should exist for a group of protein sequences that share the same biological function.(2)For different biological functions the CFP is expected to be different.

In the traditional approach, due to the complex nature of proteins and their functional groups, the use of only the absolute spectrum in the analysis of protein sequences using the informational spectrum analysis is insufficient, as biologically related features to the analysis of protein sequences can be more distinct either in the real or the imaginary spectrum. Some of the applications of ISA and RRM that are already applied in the literature and CISA will also be applicable and will be able to contribute additional information.

#### 3. Web Server Access

The CISAPS web server is available at http://sproteomics.com/cisaps. As seen in Figure 1, the user can input the required information for the analysis using the input form.