#### Abstract

Detailed information about the relationships between structures and properties/activities of peptides as drugs and nutrients is useful in the development of drugs and functional foods containing peptides as active compounds. The bitterness of the peptides is an undesirable property which should be reduced during drug/nutrient production, and quantitative structure bitter taste relationship (QSBR) studies can help researchers to design less bitter peptides with higher target efficiency. Calculated structural parameters were used to develop three different QSBR models (i.e., multiple linear regression, support vector machine, and artificial neural network) to predict the bitterness of 229 peptides (containing 2–12 amino acids, obtained from the literature). The developed models were validated using internal and external validation methods, and the prediction errors were checked using mean percentage deviation and absolute average error values. All developed models predicted the activities successfully (with prediction errors less than experimental error values), whereas the prediction errors for nonlinear methods were less than those for linear methods. The selected structural descriptors successfully differentiated between bitter and nonbitter peptides.

#### 1. Introduction

Proteins are made from peptide fragments that are well known for their nutrient, biological, and physiological roles in the human body. Peptides modulate the health-connected physiological process of the cardiovascular, nervous, immune, and nutritional systems [1]. The investigation of properties and activities of peptides as therapeutic, bioactive agents and nutrients as well as starting points for the development of drugs and drug-related compounds is one of the most interesting and demanding fields of food and drug sciences. It allows researchers to compile data sets on their structures and properties/activities. The results of these studies are useful in the development of functional foods containing peptides as active compounds and drugs [2]. Peptide bitterness is an undesirable property that is frequently generated during the enzymatic process to produce functional, bioactive protein hydrolysates or during the aging process in fermented food products [3]. Since many toxins are bitter, most mammalians including humans are instinctively averse to bitter-tasting substances in order to avoid toxin ingestion [4]. Most therapeutic peptides cannot be administered orally because of the poor biopharmaceutical performance of high-molecular-weight peptide drugs which is due to poor oral absorption, formulation stability, and degradation in the gastrointestinal tract. Studies on the origins of formulations and alternative administrations to overcome the mentioned problems have suggested different administration methods such as parenteral, oral, transdermal, nasal, pulmonary, rectal, ocular, buccal, and sublingual drug delivery systems [5–8]. Taste plays a crucial role in buccal and sublingual administration systems. Bitter taste properties in relation with the structure of the peptides in fermented food and protein hydrolyzates have been studied. Findings have shown that hydrophobicity is correlated with bitterness, and a hydrophobic interaction is needed for the bitter receptors (T_{2}Rs) to sense bitterness, whereas the amino acid sequence has no effect on bitterness [9, 10]. Moreover, introducing amino acids into the hydrophobic chain intensifies bitterness, and blocking both C and N terminals of peptides by acetylating increases bitterness about ten times [4]. It is now generally accepted that the side-chain hydrophobicity and the number of carbon atoms of the hydrophobic side chain of the peptide’s amino acids are correlated to bitterness rather than to overall hydrophobicity [4, 9, 11, 12]. In fact, the hydrophobic group of the side chain offers a binding site for the bitter taste receptor. Another binding site is a bulky basic group, including an *α*-amino group. To provide such features, hydrophobic amino acids at the C-terminal and basic amino acids at the N-terminal are necessary. The bitter peptides are composed of less than eight amino acids, and the bitterness increases as the number of amino acids increases. Peptides composed of eight or more amino acids do not differ in bitter potency, and they form a spherical shape rather than a helix conformation [4, 9].

Beyond structure-activity studies, researchers have tried to develop quantitative models to predict the bitterness of peptides [3] and to evaluate the effect of the primary structure on the potency of therapeutic peptides [1, 13]. These studies were aimed at understanding the peptide structures responsible for different activities (taste, ACE inhibition, antithrombotic, opioid and endopeptidase inhibitory, antimicrobial activities, and immune modulator) and to accelerate the studies about food-derived functional and bioactive peptides.

Different quantitative structure bitter taste relationship (QSBR) models using various descriptors have been developed for di- and tripeptides [1, 3, 13–16]; the study of tetra- and higher peptides is limited [3]. Some methods developed to predict the bitterness of dipeptides used amino acid-based descriptors, which resulted in models with acceptable prediction capabilities.

Yin et al. [17] reported 28 developed models in the year 2010 for modeling dipeptide bitterness in comparison with their own model which used E_{1}–E_{5} amino acid variables (hydrophobicity (E_{1}), steric properties or side chain bulk/molecular size (E_{2}), preferences for amino acids to occur in *α*-helices (E_{3}), composition (E_{4}), and the net charge (E_{5})) to develop QSBR models using support vector regression (SVM). Their model [17] successfully predicted () the bitterness of 48 dipeptides ( values calculated using of appendix). Table 1 (28 models taken from a reference [17] + 2 other models) summarizes the developed models for dipeptides along with their prediction errors (root mean square errors (RMSE) which were calculated using of the appendix).

Kim and Li-Chan (2006) [3] studied the QSBR of 224 di- to tetradecapeptides (2–14 amino acids peptides) using multiple linear regression (MLR) and partial least square (PLS) regression methods. They used total hydrophobicity, molecular mass (), and residual numbers to develop their regression models, and the amino acid -scores, namely, : hydrophobicity, : bulkiness/molecular size, and : electronic property, were calculated using the method developed by Hellberg et al. [20] (individually or in combination with the mentioned parameters) to develop a PLS model (Table 1, number 30). They concluded that the developed models for whole data sets and data subsets (comprising different peptide length) were significant and improved for subsets [3]. Moreover, the combination of -scores and studied variables improved the correlation of predicted and observed values for di- and tripeptides.

QSBR model development for three or more amino acid peptides has not been studied as well as that of dipeptides. This work aims to build suitable models for predicting the bitterness of 224 peptides and 5 amino acids on the basis of their calculated structural parameters. Structural parameters (0D–3D) were calculated using Dragon [28] software. The selected parameters were used to develop linear and nonlinear models, and the prediction capability and robustness of the proposed models were studied using internal and external validation methods according to the QSAR method validations guidelines [29] and references [30–33]. The impacts of the selected parameters and structural features on bitterness were evaluated according to the parameter definitions.

#### 2. Materials and Methods

##### 2.1. Data Set

A total of 229 experimental bitterness values (224 peptides and 5 amino acids) determined by human sensory evaluations were obtained from [3]. The details of peptide sequences and the bitterness activity expressed as a ( being the bitter threshold concentration ()) are summarized in Table 2. The ranges of bitterness were 1.0–5.7, and the numbers of amino acids of the studied peptides were 2–14.

##### 2.2. Calculated Descriptors

The 2D structures of all molecules were drawn and converted to 3D structures using the HyperChem 7 software. The completed model and the molecular mechanics energy minimized molecules were used as inputs for the Dragon 5.4 software [28]. The software calculated 20 subsets of molecular descriptors, including 2D autocorrelation, 3D-MoRSE descriptors, centered fragments, Burden eigenvalues, connectivity indices, constitutional descriptors, edge adjacency indices, eigenvalue-based indices, functional group counts, geometrical descriptors, GETAWAY descriptors, information indices, molecular properties, Randic molecular profiles, RDF descriptors, topological charge indices, topological descriptors, walk and path counts, and WHIM descriptors. The structural parameters calculated after discarding the constant and near-constant values (1295 descriptors) were saved and further analyzed using the SPSS 11.5, STATISTICA 7 and MATLAB 7.8 software.

##### 2.3. Outlier Detection

In order to identify possible outliers, two different methods of principal component analysis (PCA) mapping and standard scores were used. According to the nature of the outliers, which could be related to the different mechanism of binding because of the different structural features, recording errors, or inaccurate design of samples, no single outlier detection approach could identify all kinds of outliers [34]. In this study, two approaches, one in descriptor space (PCA plot of scores) and the second in response space (standard score), were used to check the outliers before the main numerical analysis [35].

##### 2.4. Training-Test Selection

The training set plays an important role in developing the properties of the model, as the more similar the molecules for training the model are, the more accurate the expected results are. Thus, the selection of the training and test sets is one of the most important steps in model development and should be done so that each set reflects the original data set as much as possible. -means clustering is one of the most frequently used methods of data set splitting. This procedure identifies relatively homogeneous groups of molecules (each observation has the nearest distance to the mean of the cluster) based on selected properties (biologic activities and structural parameters). Using this method, we divided the data set into 10 clusters, and three subsets (i.e., training, test, and validation sets) were selected from them. The validation set was excluded from the study before the descriptor selection step, and the test set was excluded before the model development step.

##### 2.5. Descriptor Selection

In order to reduce the dimension of the variable matrix, a correlation analysis was carried out, and both the highly correlated and constant variables were excluded. A home-developed MATLAB toolbox was employed that considered the following criteria for excluding a variable: [1] the intercorrelation with other descriptors higher than 0.99; [2] the correlation with activity lower than intercorrelated descriptors; and [3] the frequency of repeated values for a descriptor (lower than 10% of cases).

After this step, a genetic algorithm-partial least square (GA-PLS) algorithm [36] was used to select the most significant variables. GA-PLS (a combination of genetic algorithm and PLS regression) was also developed and utilized as a variable selection method in QSAR and QSPR studies by Leardi [36] and applied for QSAR studies [37, 38]. The MATLAB 7.8 software was used to run the GA-PLS method developed by Leardi. The variables were divided into subgroups (containing up to 200 variables), and each subgroup with the corresponding values () was introduced to the algorithm as input. The output (containing descriptors scoring based on the cross-validated percentage of explained variance) was produced after 100 runs. As the results of the GA-PLS were a bit different for each run, the top 50% scores of each subgroup were combined with the next 100 variables, and this step was repeated up to the final step. A comparative study was done in this study to check the capability of GA-PLS for descriptor selection in comparison with common methods (i.e., PLS and stepwise). Results showed that GA-PLS was able to extract more significant descriptors. The only drawback of this method was the scoring differences between different runs, which could be solved using the mentioned procedure (i.e., selection of 50% of each run instead of limited descriptors [37]).

Twenty percent of high score variables of the final step were selected as the most significant variables and were further studied by stepwise regression and bivariate correlation analysis in which the selected variables using stepwise regression were investigated concerning their intercorrelations. The less intercorrelated variables were selected for the final model development process. The complete procedure of descriptor selection is summarized in Figure 1.

##### 2.6. Model Building

Linear and nonlinear models were developed using the selected descriptors. The details of the model development and validation process are discussed in the next sections.

##### 2.7. Linear Model Using Multiple Linear Regression (MLR)

The selected parameters were used for developing QSBR equations using the multiple linear regression correlation (MLR) method, and the goodness of fit and statistical significance of the models were evaluated using (coefficient of determination), (variance ratio), and the MPD (mean percentage deviation) values calculated using:

The relative frequency of individual percentage deviations (IPD) was studied in order to define the model prediction capacity for each data point. The IPD was calculated using

In order to compare the MPD values with the relative standard deviations (RSD) between experimental bitter values measured by different research groups (inter laboratory relative standard deviations (ILRSD)), the ILRSD values for the available data were computed using

##### 2.8. Nonlinear Models Using Support Vector Regression (SVR) and Artificial Neural Network (ANN)

In the next stage, the selected descriptors were used to derive non-linear models using SVR and ANN. To construct an ANN, the Levenberg-Marquardt algorithm [39] was used by nftool toolbox of MATLAB 7.8 software to train the network. Selected descriptors and the bitter activity of training sets were introduced as input and output values, respectively. The training set was randomly classified for training, validation, and test sets in order to avoid overfitting, and then networks were trained. SVR, another non-linear model, constructs a hyperplane in a multidimensional space that provides minimum error by employing a non-linear Kernel function. Some parameters such as capacity parameter (); are related to type of noise in the data, and is related to radial base function (RBF), which is the most common type of Kernel functions. The SVM model and optimization of parameters were done using STATISTICA 7 software.

##### 2.9. Model Validation

The developed models were evaluated using the leave-many-out (LMO) cross-validation method. The values were calculated using of the appendix.

##### 2.10. Chance Correlation

To check the possibility of a chance correlation, 10 times shuffled activities were correlated to the variables, and the produced regression coefficients were compared with the developed model regression coefficient.

##### 2.11. External Validation of Proposed Models Using the External Test Set

A set of statistical criteria [30, 31] was employed to analyze 36 data points of the test set:(1),where is the coefficient of determination between the predicted and observed values.(2) and or and ,where is the coefficient of determination obtained using predicted values relative to a regression line fit for experimental values and required to pass through the origin, and is the corresponding coefficient obtained using experimental values relative to a regression line fit for predicted values and required to pass through the origin. and are the slopes of regression lines through the origin for fits for experimental and predicted data, respectively.(3).

In addition, another criterion proposed by P. P. Roy and K. Roy [32] was considered as(4),in which indicates the good external predictability of the QSAR models.

#### 3. Results and Discussion

##### 3.1. Outlier Detection

According to standard score analysis, there is no outlier. The PCA map of scores showed that most of the data points fell into the acceptable data space, and there is no outlier in the studied data set.

##### 3.2. Training-Test Selection

The selected training, test, and validation data sets are listed in Table 2 along with the computed accuracy criteria. The activity ranges for training, test, and validation sets were 1.00–5.40, 1.16–5.70, and 1.49–5.40, respectively. The number of data points for the training, test, and validation sets was 181, 36, and 10, respectively.

##### 3.3. Descriptor Selection

A total of 1292 descriptors were calculated using Dragon 5.4 software. This number was reduced to 244 descriptors after correlation analysis was performed using a home-developed toolbox. The remaining descriptors were analyzed using the GA-PLS method, and the best scoring descriptors (descriptors included in Table 3) were selected for further analysis using stepwise regression and bivariate cross-correlation studies (the intercorrelation between the selected descriptors was less than 0.9). After excluding the nonsignificant descriptors, the remaining descriptors were checked to find the possible intercorrelation, and the final descriptors were selected using a stepwise regression method (see Table 4). The final six selected descriptors were again checked for intercorrelation. Six descriptors are suitable for the construction of a QSAR model [40].

##### 3.4. Evaluation of the Selected Descriptors

SPAN belongs to the size descriptors which evaluate the dimension of the molecule and often calculate it from the molecular geometry. This descriptor is a suitable size descriptor for macromolecules [41], and its selection for the studied data set with the developed method showed the method to be suitable for descriptor selection. The correlation study results (Figure 2) showed that bitterness increased upon the enhancement of the SPAN (Figure 2(a)). This finding is in agreement with previous findings [1–13] which suggested a significant correlation between peptide size and bitterness. The investigation of the correlation of peptide subsets with this descriptor showed less correlations in comparison with the whole dataset. Therefore, SPAN can be regarded as a general descriptor rather than a specific one for peptide subsets. Thus, SPAN can be used in the primary steps of peptide bioactive compound discovery to recognize bitter compound fractions.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

Mean square distance (MSD) index (Balaban) [28, 41] contributes negatively to bitterness (Figure 2(b)). MSD decreases with the increasing of molecular branching in an isomeric set [41]. Furthermore, it decreases with an increase in the number of atoms. This is in agreement with the findings on the increase in bitterness following an increase in the number of amino acids in the studied peptides. The lower correlation coefficient between peptide subsets by MSD in comparison with the total dataset showed that MSD is a general bitterness indicator rather than a subset bitterness descriptor.

Bitterness increases with the enhancement of E3s (Figure 2(c)). E3s is one of the accessibility directional WHIM indices which is weighed by atomic electrotopological states [28, 41]. Indeed, this descriptor defines size, shape, polarizability, and the conformational properties of the studied molecules together. Its correlation with the bitterness is weaker than MSD and SPAN descriptors, but its removal from the model decreases the correlation coefficient significantly. It is not a strong identifier for utilization in drug development processes.

G3p, the 3rd-component symmetry directional WHIM index (weighed by polarizability) [28, 41], showed a negative correlation with bitterness (Figure 2(d)). Along with E3s, these descriptors contribute to the electrical properties of the molecule. G3p decreases with the increase in the number of amino acids. In fact, more bitter compounds possess larger G3p values. Similar to the SPAN and MSD descriptors, the correlations with subsets are less than the general dataset, and this descriptor can be regarded as a general identifier. HATS8u (Figure 2(e)) belongs to the GETAWAY descriptors. These descriptors are molecular descriptors derived from the molecular influence matrix (MIM) [12, 28]. HATS, indices known as spatial autocorrelation based descriptors, encode information nonstructural fragments [28, 41]. They are suitable for describing differences in a congeneric series of molecules. The effective position of substituent and fragments in molecular space and information about the molecular size and shape can be encoded by unweighted HATS indices. Moreover, they are independent of molecule alignment.

In summary, GETAWAY descriptors encoded local information while WHIM descriptors related to the holistic information of the molecules; thus, their joint uses are advised. HATS8u decreases with an increase in molecular size. The negative relation of this descriptor with bitterness is in agreement with the findings about the size and bitterness relation.

3D-MoRSE descriptors (3D molecule representation of structures based on electron diffraction) are derived from infrared spectra simulation using a generalized scattering function [28]. Mor11v, weighed by van der Waals volume, belongs to these descriptors which can be regarded as an indicator of size, mass, and volume of the molecules. By decreasing the size of the peptide, the Mor11v values tend toward zero (Figure 2(f)). In fact, the absolute values of Mor11v decrease by the decrease in size of the molecule. The absolute quantity of it worsens the model, and it should be used in the model in the present form including negative and positive values for different peptides. The only similar trend was observed for alogp values of the peptides. The negative Mor11v values belong to the molecules with negative alogp values (less hydrophobic peptides). 3D-MoRSE descriptors cannot encode the lipophilicity of the molecules. The overall correlation of bitterness to this descriptor is positive [28].

##### 3.5. Model Building Using Linear Model (MLR)

The selected descriptors were used to develop six MLR equations containing 1–6 descriptors. The adjusted showed that all descriptors improved the model fitting, the values accurately depicted the fitting improvement, and the best-fitted model could be rewritten as follows: where S.E.P. stands for standard error of prediction and ND shows the number of data points. The MPD for test and validation sets were 18.1 (±15.5) () and 16.9 (±12.6) (), respectively. The relative frequency analysis of the prediction errors showed that more than 50% of data can be predicted by the prediction error of less than 15%, which is acceptable for biological measurements, where the mean ILRSD for bitter activities of 19 dipeptides were measured by different research groups and were 12.1 (±10.7)%. In addition, the IPD frequency trend (Figure 3) is similar for training and test sets. The MPD for the peptides by values less than 2.0 was 24.3 (±22.2) and for the peptides with more than 2.0, it was 11.1 (±7.9). The highest IPD value (100.0%) in the training set was calculated for GR (), and in the test set, it was calculated for PGR (IPD = 65.4% and ). In other words, the developed model was not able to predict the bitterness of less bitter peptides, and the selected descriptors are specific for the evaluation of the bitter peptides’ interactions with the receptor.

**(a)**

**(b)**

**(c)**

LOO cross-validation was done using the PLS toolbox of MATLAB software. The value and RMSE were 0.76 and 0.44, respectively. The values for LMO cross-validation are reported in Table 3. The ranges of values were 0.61–0.88, which was in an acceptable range compared with values. The ranges of corresponding and adjusted for the desired subsets were 0.80–0.82 and 0.79–0.82, respectively (see Table 5). Chance correlation ( randomization) analysis was done using 10 times shuffled bitter activity, and the results ( = 0.01–0.10) rejected the possibility of fortune correlation (see Table 6).

##### 3.6. Model Building Using Nonlinear Models (ANN and SVM) and Comparison with the Linear Model (MLR)

The six selected descriptors were introduced to ANN as input values and the bitter activity as output data, and the networks were developed using the Levenberg-Marquardt algorithm [38]. The number of hidden layers was three. In addition, SVM models were developed using STATISTICA 7 software. Using the training set, three parameters of SVM were optimized by 10-fold cross-validation. The optimized values of , , and were 91, 0.07, and 0.06, respectively. The MPD values of the proposed models for training, test, and validation sets were 13.8, 16.8, and 16.9 for MLR; 13.0, 16.0, and 15.0 for SVM; and 13.1, 16.1, and 14.8 for ANN, all of which are in acceptable ranges, and there is no significant difference between these subsets. The IPD frequencies (Figure 3) revealed that both SVM and ANN methods produced more accurate results, especially for the test sets. The plots of experimental versus predicted values (Figure 4) confirmed the discussed results.

##### 3.7. External Validation of the Proposed Models

Statistical criteria of external test sets are depicted in Table 7. The results show that the proposed models using linear and nonlinear models passed the proposed statistical criteria in the literature, and they are robust and valid for external prediction.

##### 3.8. Comparison with Previous Model

The only similar study was published by Kim and Li-Chan [3]. They used the amino acid three -scores along with three parameters (total hydrophobicity, residue number, and mass value) to develop a PLS model. A comparison of the newly developed model with their model is summarized in Table 8. It should be noted that the aim of this paper was to develop a general model rather than different models for different subsets, and reported correlation coefficients belong to the general model which was computed for subsets. The comparison of the corresponding values showed that the developed model could represent the bitter activity variance of three and more peptides better than the PLS model, while for dipeptides the PLS model produced more accurate results. Developed SVM and ANN models resulted in more accurate results for the total data set compared with both the PLS and MLR models (Tables 1 and 5; Figure 4).

#### 4. Conclusion

General MLR, SVM, and ANN models were developed to predict the bitterness of 229 peptides and amino acids. The capability of the MLR model to reveal the impact of each descriptor on bitter activity was its main advantage, where more accurate predictions by SVM and ANN made them suitable models for precise predictions. Obviously, individual models (i.e., models developed for each peptide subset) produced less prediction errors, but considering the convenience of application of the general models, such models are preferred during the primary stages of peptide production and evaluation. The developed models can be used in nutraceutical and pharmaceutical industries.

#### Appendix

where denotes the number of data points and and are the predicted and observed ( is the bitter threshold concentration ()) values where is sum of squares, is mean square, . is regression, . is residual and is cross validated

#### Conflict of Interests

The authors certify that there is no financial relation between them and the mentioned commercial identities in the paper.

#### Acknowledgments

The authors thank the Biotechnology Research Center, Tabriz University of Medical Sciences, for a partial financial support under Grant no. 5/85/765.