Abstract

Microarray datasets are widely used resources to predict and characterize functional entities of the whole genomics. The study initiated here aims to identify overexpressed stress responsive genes using microarray datasets applying in silico approaches. The target also extended to build a protein-protein interaction model of regulatory genes with their upstream and downstream connection in Arabidopsis thaliana. Four microarray datasets generated treating abiotic stresses like salinity, cold, drought, and abscisic acid (ABA) were chosen. Retrieved datasets were firstly filtered based on their expression comparing to control. Filtered datasets were then used to create an expression hub. Extensive literature mining helped to identify the regulatory molecules from the expression hub. The study brought out 42 genes/TF/enzymes as the role player during abiotic stress response. Further bioinformatics study and also literature mining revealed that thirty genes from those forty-two were highly correlated in all four datasets and only eight from those thirty genes were determined as highly responsive to the above abiotic stresses. Later their protein-protein interaction (PPI), conserved sequences, protein domains, and GO biasness were studied. Some web based tools and software like String database, Gene Ontology, InterProScan, NCBI BLASTn suite, etc. helped to extend the study arena.

1. Introduction

Plant stresses are the reasons for food insecurity and thus are a major threat to mankind [1]. Environmental stress is one of the biggest problems which has already been counted as a responsible phenomenon for reducing crop yields [2]. The effects of climate change like an increase in global temperatures may lead to drought, and increase in humidity is likely to increase plant susceptibility to pathogens. This has been recorded to be a major source of crop spoilage all over the world [3]. These factors are conspiring to greatly endanger food security, leading to social instability and increased poverty, particularly in developing countries. Clearly, this is not just a problem for the developing world but is a global problem affecting the entire population [4]. It is an utter necessity to understand the mechanisms by which plants adapt to environmental stresses to maintain world food supplies duly.

Plants respond to environmental stresses at both cellular and molecular level by altering the expression of many genes via different types of complex molecular signaling networks [5]. The knowledge of these pathways including identification of the regulatory codes would provide opportunities developing stress tolerance plant production through genetic manipulation. In the postgenomic era, understanding of these cellular systems is becoming increasingly important for biologists. Several methodologies are available [6, 7] but the data generated from these methods covers just a few complexes or pathways and is limited to a handful of model organisms. As a result, computational bioinformatics methods have been developed to integrate this data and to extrapolate from it to provide predictions for proteins and organisms not yet experimentally well characterized [810].

Various abiotic stresses, such as drought, high salinity, and variable temperature, negatively impact plant growth and productivity. Plants have adapted to respond to these stresses at the molecular, cellular, physiological, and biochemical level, enabling them to survive. Various adverse environmental stresses induce the expression of a variety of genes in many plant species [11, 12]. Numerous stress-induced genes have been identified using microarray experiments [13, 14]. The products of these genes are thought to promote stress tolerance and to regulate gene expression through signal transduction pathways [11].

In current study, in silico approaches focused on finding the connection between upregulated genes in different abiotic stress conditions. Arabidopsis is one of the model organisms for studying plant genetics and development. The genome of Arabidopsis is the first to be sequenced among higher plants and is believed to comprise at least 30,700 genes. Of these genes, the function of approximately one-third (9194) remains unknown according to the functional Gene Ontology (GO) category listed by the Arabidopsis Information Resource (TAIR) [15]. Of the remainder, a large proportion lack complete or adequate functional annotation. We in the present study aimed at constructing a genome-wide functional network of Arabidopsis by integrating relations extracted from diverse data sources. So, the aim was to construct a genome-wide functional network of Arabidopsis by integrating relations extracted from diverse in silico data sources. The experiment initiated aiming at upregulated genes, TF, and enzymes during four abiotic stresses, like, cold, drought, salinity and responses to abscisic acid.

2. Materials and Methods

2.1. Work Plan

The whole working procedure is demonstrated in Figure 1(a). Here, only processed datasets were taken for the study.

2.2. Microarray Datasets Retrieval

In the current study, microarray datasets were retrieved from ArrayExpress database [16] from EMBL-EBI (https://www.ebi.ac.uk/arrayexpress/). The retrieval was random and the only specification was four different stress signals like salinity, drought, cold, and response to ABA. Only processed datasets were downloaded. The retrieved datasets were generated using at least three replicates. The processed datasets were initiated with at ~15,000 transcripts. A brief description of the collected data with their ArrayExpress accession has been mentioned in Table 1.

2.3. Data Filtering

All collected datasets were downloaded and copied in an excel sheet. The target was to create a scatter plot based on the value (log value) generated by the microarray expression. A simple layout was created from the value corresponding to the samples represented in the study.

2.4. Common Pattern Study

Cytoscape version 3.1.0 (http://www.cytoscape.org/) networking software was used to create an expression hub using the filtered dataset. This program was run with default parameters to import and export data in a variety of formats, from simple delimited text formats to XML and other sophisticated formats for sharing data with other programs [17]. The collected datasets were renamed to have the extension “.pvals” to be recognized by Cytoscape. The data were organized as a matrix, with each row representing the expression results for one gene/protein in the network. The first row provided column labels. The first column holds the gene/protein identifier, while the second column contains expression value. The data also contained values, so each experiment was represented with two columns: the first is assumed to contain the expression measure, the second contains the significance measure, and the two columns were exactly having the same label.

2.5. Retrieving Common Upregulated Genes under Four Stresses

All expressed genes from the physically interacting nodes were put in Venny to find the common genes/proteins, and a Venn diagram was produced as an output file (http://bioinfogp.cnb.csic.es/tools/venny/). The insertions of the data were selective and the datasets were named according to the microarray experiments.

2.6. Protein-Protein Interaction Study

Those commonly upregulated genes were taken for further analysis. Protein-protein interaction was identified using String 9.05 (http://string-db.org/) database [18] and only the validated interactions were considered for later analysis. The interaction network for the target proteins were further validated using Advanced Network Merge plug-in integrated in Cytoscape platform. The integration here was based on functional resembles between the nodes.

2.7. Selection of the Regulatory Proteins/TF/Enzymes

Through extensive text mining and literature search the initial stress related gene pool for stress response was generated. For searching literatures associated with Arabidopsis, the most important collection of scientific publications used was PubMed (http://www.ncbi.nlm.nih.gov/pubmed). Other data-mining tools used for literature search were Highwire Topicmap and Alibaba (http://alibaba.informatik.huberlin.de) (Figure 1(b)).

2.8. Target Gene Homolog Identification

Target genes were blasted in BlASTn suite from NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastnPAGETYPE=BlastSearch&LINK_LOC=blasthome). The BLAST hit was optimized for highly similar sequences (Megablast). The best homolog of the target gene and function were collected for further analysis.

2.9. Gene Ontology (GO) Analysis

Gene ontology program [19] from NCBI was used to find out functional similarities and to calculate the branching pathways by putting each individual as separate input. The output was considered from all three categories (cellular, molecular, and biological process) converted into graphical representation.

2.10. Protein Domain Retrieval

Targeted genes functions were determined by finding actual protein domains using InterProScan [20] from EBI (http://www.ebi.ac.uk/interpro/). The conserved domains were analyzed to predict functional variability before or after stress condition or finding the actual mechanism behind expressing these common genes in different abiotic stresses.

2.11. Data Analysis

The robustness of the network was analyzed by checking several different parameters of the network. The statistical probability was counted following the methods mentioned by Zaman et al. [21]. The data was modified as a suitable input file for Gene Cluster 3.0 functional clustering tool. No log transformation was required for the data as they were deposited as log transformed file in the database. Hierarchical cluster was formed using complete linkage using uncentered correlation matrix. SOM (self-organizing matrix) with 100,000 iterations was used for the cluster analysis. The clustering was finally finished performing a PCA (principle component analysis).

3. Results and Discussion

3.1. Arrayed Data Analysis

Primarily four microarray datasets from ArrayExpress database were taken. Samples data were collected and arranged in an excel sheet. The samples expression based on log value was justified and only the values of more than 5 were considered in this study. Figure 2 represents the overall expressed data processed from the microarray datasets.

3.2. Selection of Target Genes

The collected data were further screened based on the expression mentioned in Figure 2. Only upregulated genes that are grouped in 15–20 range in Figure 2 were sorted out. A good number of overexpressed genes, TF and enzymes could be selected following the range mentioned and the exact numbers were shown in Table 2.

3.3. Expression Hub with the Selected Transcripts

The selected transcripts ID were then merged into Cytoscape software and the aim was to create an expression hub based on the physical interaction value as well as maximum expression parameters. The hub generated (Figure 3) here demonstrated the interaction and distance among the targeted transcripts. Here the connection among transcripts made three clusters (Figure 3). Only the cluster at the middle connected with the maximum number of the transcripts and these clusters of transcripts (Figure 3(b)) were taken for further screening.

3.4. Common Genes Found in Different Stress Signals

The commonly expressing (i.e., upregulated) and physically connected genes found in different stress signals were then sorted by using the Van Diagram technique to create a Venn diagram, so that common transcripts could easily be isolated from the transcript chunks. The Venn diagram demonstrated that only 42 commonly upregulated genes were found in the selected datasets (Figure 4).

3.5. Protein-Protein Interaction Study of the Forty-Two Upregulated Transcripts

The 42 commonly upregulated genes (Supplementary File 1, see Supplementary Material available online at http://dx.doi.org/10.1155/2014/896513) were then taken for further studies to see their interaction among themselves in terms of physical interaction, coexpression, literature mining, and so forth. The interaction was generated in String Database (version 9.05) and it was found that almost every upregulated gene came in contact with the others and showed a strong corelation. About 30 genes were directly connected while others remain distant in connection (Figure 5). Directly connecting proteins were then brought together to see the interaction (Figure 6) and a strong corelation between transcription factors and antiporter genes and enzymes was observed.

It was revealed that (Figure 5) forty-two upregulated genes are not in a single protein interaction. The database (String 9.05) provided the prediction and the validation of the submitted samples query on the basis of high throughput technique, coexpression and literature mining, and so forth. The forty genes interactions in terms of proteins were checked and only validated interaction was counted for further studies. Only thirty proteins were selected based on validated physical interaction data. Later thirty proteins were submitted into string to get the connection (Figure 6).

3.6. Connectome Genes and Transcription Factor

In the light of the above result eight genes TF and enzymes were bridging among each other and brought their downstream targets in the expression hub. These strongly corelated connectomes in abiotic stress tolerance were short-listed (Table 3) for further studies. It was revealed that only eight major genes, those selected eight major genes (Table 3), were then further analyzed.

In the next section of the result, all possible characters of the targeted eight molecules were revealed depending on their amino acid, protein domains, individual interactomes, and gene ontology to get the whole pictorial view of the genes in three different sectors of life system, biological, molecular, and cellular, respectively. Available free tools mentioned in Section 2 have been extensively applied to get the results to make individual interpretation.

3.7. Conservancy Exploration of the Targeted Genes

Selected eight genes’ amino acid sequences were collected from NCBI database and BLASTed in NCBI BLASTp suite using the protein-protein blast algorithm. The conserved domains were retrieved to understand functional entities of the target proteins. Only the homologs close to the search molecule were considered to find out protein superfamily conservancy (Table 4).

All targeted genes are highly conserved among other species. The mentioned superfamilies (Table 4) are specified in the NCBI database and demonstrated the characterized functions and properties. All of molecules and found superfamily domains have been characterized as a major player on abiotic stress tolerance.

3.8. Functional Domains Retrieval

The protein domains (Table 5) are significantly diverse and bear the importance of each individual protein in stress response in different abiotic stress simultaneously. Only an exchanger domain () was found to be overrepresented in both NHX1 and SOS1 which demonstrated its functional integrity toward salinity stress. Some kinases are present; their roles have already been explored in upregulating stress responsive genes during stress(es). Interestingly, all these different domains are highly up-regulated in all selected abiotic stresses like cold, salinity, drought and ABA. This correlation helps to hypothesize those single candidate genes, TF and enzymes have different roles in different stress signals.

3.9. Protein-Protein Interaction of the Targeted Eight Genes

Protein-protein interaction was observed using string 9.05 database. Each target molecule based PPI was checked. The PPI interaction and stringency to each other was calculated (Figure 7). All of the proteins interact mostly with stress responsive genes and TF. The calculations showed (Supplementary File 2) that each targeted protein brings more stress responsive molecules into a single string so that they can provide tolerance. Moreover, very specific protein like NHX1 also connects with some cold responsive and drought response elements which clarified its functional activity during targeted abiotic stress response.

3.10. Parameters for the Integrated Network

The parameters revealed the robustness of the network. Targeted all up-regulated proteins could build up a protein-protein interaction network among themselves (Figure 8). Clustering coefficient of thenetwork is 0.62. Network diameter was 14 units. Network centralization valued 0.094. Shortest path covered 86% of the proteins. Average number of neighbors for the nodes was 10.35. Network heterogeneity was 0.868. The red line fits the line whereas the blue line shows the power law distribution.

3.11. Gene Ontology Study

Gene ontology study helped to get all correlated real and hypothetical functions of the target protein molecules. It depicted (Table 6) basic and stress inductive functions.

3.12. Final Integration of the Targeted Genes in a Single Correlation Image

A correlation image was drawn based on the calculation and analysis provided in Supplementary File 2 to address an internal correlation of these eight protein molecules in Arabidopsis (Figure 9).

In model plant, Arabidopsis thaliana, thousands of genes have been identified to play different roles in different responses (tolerance and susceptibility). Naika and his colleagues [22] did a similar study with Arabidopsis on stress responses. They compiled a dataset on abiotic stress responses and used functional enrichment analyses like GO (Gene Ontology) and PO (Plant Ontology) annotation to understand plant specific features associated with differential upregulation of genes for individual signals. They have also used STRING to derive interactome and to predicate protein-protein interaction like the present study using stress responsive transcription factors database (STIFDB), PubMed, and Gene Expression Omnibus (GEO). They found 3091 genes differentially upregulated during 14 different abiotic stresses, namely, abscisic acid, aluminum, cold, cold-drought-salt, dehydration, drought, heat, iron, light, NaCl, osmotic stress, oxidative stress, UV-B, and wounding. In the present study, only four out of the 14 abiotic stress signals, Slat, Cold, Drought, and ABA, were studied. However, Naika and his colleagues [22] kept the upregulated gene group under an ID and did not go further to gene level to find out the specific genes and their specific interaction [22]. But in the present study such interconnectome was done in Arabidopsis thaliana.

The aim of the study was initially to find out commonly upregulated genes in different abiotic stress and by doing that the ultimate goal was to hypothesize a gene regulatory network. In current study, four abiotic stress dependent gene expression counts were taken. The expression hub creations led to the finding out of the most common genes/proteins that are upregulated in all targeted abiotic stress conditions. Then the sorting was done based on the connectome data and the only bridging molecules were taken for further studies. Then, extensive bioinformatics tools and databases were used to characterize all individuals in terms of similarities, conservancy, protein domain, GO, and individual interaction. It turned out that all individuals are highly correlated in functions and diverse in mechanism at the same time. Most of the GO annotation referred to functional entities in common patterns which helped to create a regulatory network that depict that these targeted genes/proteins are the most common and role player in plants during stress and maintain some uniqueness. Moreover, most of the functions of these targeted genes/proteins have DNA binding properties which can be a major basis of saying that these molecules are most competent for initiating stress tolerance response as they bring about more TF, enzymes, and/or other regulatory genes in the same string during stress tolerance.

4. Conclusion

Bioinformatics study based on online tools and database using freely available microarray datasets show that there are some common genes upregulated during various environmental stresses. The proposed protein-protein interaction network may solve the mystery relating abiotic stress tolerance mechanism, so further validation by wet lab experiments are required to resolve the secret. So, in future attempts need to be taken in the wet bench to analyze their activity in total to have an in-depth idea of their actual activity under stress condition so that it could bring some answers to the farmers in the crop sector as well as in the nature.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

The authors would like to thank BAS-USDA PALS program for funding the project.

Supplementary Materials

Commonly up-regulated genes, TF and enzymes in different abiotic stresses were selected and forty two (42) candidate genetic materials were found and further investigated. The supplementary file 1 describes all initial targeted 42 candidates with their functions and their database ID.

  1. Supplementary Material