Abstract

Synonymous codon usage bias is an inevitable phenomenon in organismic taxa across the three domains of life. Though the frequency of codon usage is not equal across species and within genome in the same species, the phenomenon is non random and is tissue-specific. Several factors such as GC content, nucleotide distribution, protein hydropathy, protein secondary structure, and translational selection are reported to contribute to codon usage preference. The synonymous codon usage patterns can be helpful in revealing the expression pattern of genes as well as the evolutionary relationship between the sequences. In this study, synonymous codon usage bias patterns were determined for the evolutionarily close proteins of albumin superfamily, namely, albumin, -fetoprotein, afamin, and vitamin D-binding protein. Our study demonstrated that the genes of the four albumin superfamily members have low GC content and high values of effective number of codons (ENC) suggesting high expressivity of these genes and less bias in codon usage preferences. This study also provided evidence that the albumin superfamily members are not subjected to mutational selection pressure.

1. Introduction

Amino acids, the monomeric unit of proteins, are encoded by triplet of nucleotides called codons. Most of the amino acids have alternative codons which are known as synonymous codons. The frequencies with which these synonymous codons are used are unequal [1], some codons being used preferentially than others. Furthermore, Plotkin et al. [2] reported that codon usage is tissue-specific. The phenomenon of codon usage bias, which can be interpreted as an outcome of either mutational bias or translational selection, is an essential feature of most genomes across all the three domains of life [3]. The patterns of codon usage within the mammalian genomes are markedly different from other taxa. In mammals, the codon usage bias is found to be influenced by the variation in isochores (GC content) or variation in tRNA pool of the cell [4, 5]. The differences in codon usage or the variation in tRNA abundance can elicit varied responses to the environmental changes, in terms of regulation of translation mechanism and cell phenotype [6]. Urrutia and Hurst [7] reported that, in humans, the codon usage bias is positively related to gene expression but is inversely related to the rate of synonymous substitution. Several factors contribute to synonymous codon usage bias such as gene expression level, protein hydropathy, protein secondary structure, and translational selection [811]. Information on the synonymous codon usage pattern can provide significant insights pertaining to the prediction, classification, and molecular evolution of genes and design of highly expressed genes and cloning vectors [12]. It may be useful in better understanding of host-pathogen interactions as information on synonymous codon usages can reveal about the host-pathogen coevolution and adaptation of pathogens to specific hosts [13].

The evolutionarily close proteins of albumin superfamily are comprised of albumin (ALB), -fetoprotein (AFP), vitamin D-binding protein (VDBP), and afamin (AFM). In human, the genes encoding these proteins are mapped to chromosome 4. These proteins are synthesized primarily and predominantly in liver but the expression pattern varies temporally. One common functional property amongst all the members of albumin superfamily is their tendency to serve as transporters to various cellular components, metabolites, and so forth. ALB, an abundant serum protein of MW of 66 KDa, binds and transports a variety of ligands such as steroids, fatty acids, bilirubin, lysolecithin, prostaglandins, thyroid hormones, and drugs. In addition to this, ALB is known to be involved in various cellular functions including oxygen-free radicals scavenging, anticoagulation, and maintenance of physiological pH and oncotic pressure of the plasma [14]. AFP (MW 67 KDa), a serum glycoprotein which is expressed at high levels by fetal liver and visceral yolk sac [15, 16], is critical for the female fertility rather than embryonic development [17]. VDBP or Gc globulin (MW 58 KDa) is synthesized by various tissues, namely, liver, kidneys, gonads, and fat, and also by neutrophils [18]. Apart from binding and transporting vitamin D sterols, VDBP’s physiological functions include scavenging of G-actin [19], macrophage activation [20], and enhancement of chemotactic activity of C5a and C5a des-Arg molecules [21, 22]. AFM or -albumin (MW 87 KDa) is synthesized by liver and brain capillary endothelial cells. It mediates the transport of -tocopherol across the blood-brain barrier [23].

The members of albumin superfamily have been found to act as markers in various disease states in humans. AFP in maternal serum is an indicative of Down’s syndrome and neural tube defects in the fetus [24, 25]. AFP levels are elevated in patients with high risk for hepatocellular carcinoma. In some patients, an increase in AFP levels manifests liver metastasis with gastric cancer and the condition is termed as -fetoprotein producing gastric cancer (AFPGC) [26, 27]. VDBP may serve as a biomarker for vascular injury as predicted by proteomic identification [28]. AFM may act as a potential adjunct marker to cancer antigen 125 (CA125) for the diagnosis of ovarian cancer [29]. A vast array of research has been done on the members of albumin superfamily; however, so far, studies related to the usage of synonymous codon and the factors influencing the codon usage in this gene family have not been done. In this study, we applied bioinformatics approaches to elucidate the pattern of synonymous codon usage bias and its consequences on the expression level of genes in the albumin superfamily.

2. Materials and Methods

2.1. Sequences

The mRNA reference sequences of human serum albumin (ALB), afamin (AFM), -fetoprotein (AFP), and vitamin D-binding protein (VDBP) in FASTA format were retrieved from GenBank of the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/genbank/). Open Reading Frame (ORF) of the mRNA sequences of human albumin superfamily was obtained by using ExPASy Translate tool (http://web.expasy.org/translate/).

2.2. Hydrophobicity Analysis

Grand average of hydrophobicity score (Gravy score) was calculated to quantify the general average hydrophobicity for the translated gene product found in albumin superfamily. It was calculated as the arithmetic mean of the sum of the hydrophobic indices of each amino acid as shown in where corresponds to the number of amino acids, while represents hydrophobic index of amino acid. The Gravy score of a protein can be either negative or positive depending on the frequency of amino acids with distinct properties. Negative Gravy score implies that the protein is hydrophilic and is soluble in water. In contrast, protein with positive Gravy is considered as hydrophobic and is water soluble [30].

2.3. Codon Usage Analysis

The nucleotide distribution for albumin superfamily was analyzed using ExPASy ProtParam tool (http://web.expasy.org/protparam/). The quantities of individual nucleotide (A, T, G, and C) were determined and used to sum up the AT and GC content for each protein in the albumin superfamily.

2.4. Rare Codon (RC) Analysis

Rare codon (RC) is considered as low-usage codon in the genome such as synonymous codon or stop codon [31]. The RC analysis was performed using the GenScript web server (http://www.genscript.com/cgi-bin/tools/rare_codon_analysis/) to examine the number of highest-usage and lowest-usage codons in the human albumin superfamily.

2.5. Indices of Codon Usage Deviation

Indices of codon usage deviation were calculated using CodonW (J Peden, version 1.4.2 http://codonw.sourceforge.net/) [32] to measure deviation between the observed codon usage and expected codon usage. Based on that, two internal measures were applied including identification of GC variation and third nucleotide preference in codon [33, 34]. These were obtained by calculating the number of GC nucleotides and number of G or C nucleotides at the third position of synonymous codon (GC3), except the start and termination codons. In addition, the expected effective number of codons (ENC) for each albumin superfamily protein was calculated. ENC is the measure of codon usage affected only by the GC3 as a consequence of mutation pressure and genetic drift. The ENC was calculated according to [35] where corresponds to the GC3 value ranging from 0 to 100%.

2.6. Relative Synonymous Codon Usage (RSCU)

Relative synonymous codon usage (RSCU) was calculated in order to examine the frequency of each synonymous codon that encoded the same amino acid without confounding effect on the composition of amino acid. The index was calculated as follows [36]: where is the amount of th codon to represent the th amino acid that can be encoded by synonymous codons.

3. Results and Discussion

Genomic information of mRNA sequences of the four members of human albumin superfamilyis shown in Table 1. The mRNA sequences of albumin superfamily were translated into protein sequences using the ExPASy Translate Tool. Only the ORF with no intermediate stop codon was selected for codon usage analysis. The similarity of nucleotide and amino acid sequences of the albumin superfamily members is summarized in Figure 1. The results showed that ALB and AFP are more closely related compared to AFM and VDBP. AFP and VDBP have almost similar gene length of 2032 bp and 2024 bp, respectively. ALB possesses the longest (2264 bp), while AFM has the shortest gene length (1997 bp). Moreover, human ALB and AFP possessed exactly the same length of ORF (1830 bp), while AFM (1800 bp) has similar length of the ORF compared to that of ALB and AFP. VDBP (1425 bp) has the shortest length of ORF within the albumin superfamily. The similarity pattern of ORF among ALB, AFM, and AFP indicated that they may carry out similar biological functions, especially AFM, since its function is not well-known.

The solubility of protein for the members of the albumin superfamily was assessed through Gravy score (Table 1). All the family members are found to have negative Gravy score, suggesting that these proteins are water soluble. This is in accordancewith the biological role of these proteins as serum transporters.

The nucleotide distribution of albumin superfamily is shown in Table 2. The members of this superfamily exhibit low GC content (<44.63%). ALB and AFP shows similar nucleotide distribution pattern implying that they share similarity in their structures and biological functions. There is a close relationship between the nucleotide composition and gene function [37]. AFM has the highest AT content, whereas VDBP has the lowest AT content. Although AFM and VDBP are grouped in the same superfamily, they show differential nucleotide composition suggesting variation in their biological functions compared to the other members of albumin superfamily.

Rare codon analysis was carried out using the GenScript web server as described in Materials and Methods. A graph of codon frequency distribution was plotted to identify the quantities of rare codons present in each albumin superfamily protein (Figure 2). Frequency of codon usage with a value of 100 indicates that the codons are highly used for a given amino acid. Conversely, the frequency of codon usage with a value of less than 30 is determined as low-frequency codon, which is likely to affect the expression efficiency. Percentages of low-frequency codon present in protein ALB, AFM, AFP, and VDBP are 4%, 3%, 4%, and 4%, respectively. This result suggested that members of the albumin superfamily contain a significantly small number of rare codons that may reduce translational efficiency of the genes.

Indices of codon usage deviation are used to determine the differences between the observed and expected codon usage. The results for the effective number of codon (ENC), GC content, and G or C nucleotides at the third position of synonymous codon are summarized in Table 1. The effective number of codons (ENC) for each member of human albumin superfamily was calculated in order to examine the pattern of synonymous codon usage independent of the gene length. The ENC value ranges from 20 to 61, in which value of 20 indicates extreme bias toward the usage of one codon, while value of 61 represents equal usage of the synonymous codons [35, 38]. Result from this analysis revealed that the ENC value of albumin superfamily varies from 51.65 to 56.62. The overall ENC value of albumin superfamily is greater than 50. The high ENC value suggested that the synonymous codons of albumin superfamily were equally used and hence displayed less biased synonymous codon usage.

The GC content of albumin superfamily is given in Table 1. GC content may affect the thermostability, bendability, and the ability of DNA helix transition from B to Z form. GC content can be related to the ability of coding region to be in an open chromatin state, leading to active transcription [39]. It is evident that all the members of albumin superfamily genes have low GC content, indicating that these family members are highly expressed. Furthermore, it has been reported that highly transcribed genes may have low mutation rates because they are subjected to DNA repair [40]. However, within the albumin superfamily, VDBP contains the highest GC content indicating that it has the lowest expressivity level.

GC content at the third position of codons (GC3) is a putative indicator of the extent of base composition bias. Table 1 revealed that the albumin superfamily has low GC3 values ranging from 37.1% to 42.8%. The albumin superfamily has low GC3 value because the majority of genes in this superfamily are located in AT-rich region. Genes in AT-rich regions within the genome would prefer to use A or T ending codon. The low usage of codons ending with G or C signifies less GC codon usage bias in albumin superfamily. In other words, it proved the homogeneity of synonymous codon usage pattern in albumin superfamily.

The synonymous codon bias usage of each albumin superfamily protein was computed and tabulated in Table 3. The most preferentially used codon for a given amino acid is highlighted in red. Asn of AFP and His, Cys, and Arg of VDBP have equal usage of the synonymous codons. The variation of relative synonymous codon usage (RSCU) values not only indicated the different frequency of occurrence of each codon for a given amino acid in different albumin superfamily protein but also revealed the preference of either A + U or G + C codon usage as listed in Table 3. The results of RSCU analysis (Table 3) are summarized in Table 4. Preferential codon usage in albumin superfamily indicates that the codons with A or U at the third position are more preferred compared to G or C ending codons. Table 4 also shows that the total score of A + U and G + C codon usage in the proteins of albumin superfamily is not equal to 20. It is because some amino acid residues are encoded in equal frequencies by both A or U and G or C ending codons and hence are excluded from the analysis. The tendency of albumin superfamily to use high A + U and low G + C indicated that the mutational bias does not play a significant role in synonymous codon usage.

4. Conclusions

The members of albumin superfamily, namely, ALB, AFP, AFM, and VDBP, exhibit sequence and structural similarities. The proteins possess three homologous folding domains as a result of conserved pattern of cysteine residues in the members of albumin superfamily [41, 42]. Our study on codon usage bias in the members of the albumin gene family revealed that they are also similar in terms of their low GC content, low GC3, and high ENC values. In addition, they are not having a bias in the usage of synonymous codons and are highly expressible genes. Furthermore, low GC and GC3 values revealed that mutational bias and translational selection do not play a significant role in shaping the codon usage pattern in the albumin superfamily.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank Director General of Ministry of Health, Malaysia, for granting permission to publish this paper. This research is cosupported by the High Impact Research Grant UM-MOHE UM.C/625/1/HIR/MOHE/30 from the Ministry of Higher Education, Malaysia, and University of Malaya Research Grant (UMRG), Grant no. RP004C-BAFR.