Identification of 170 New Long Noncoding RNAs in Schistosoma mansoni
Long noncoding RNAs (lncRNAs) are transcripts generally longer than 200 nucleotides with no or poor protein coding potential, and most of their functions are also poorly characterized. Recently, an increasing number of studies have shown that lncRNAs can be involved in various critical biological processes such as organism development or cancer progression. Little, however, is known about their effects in helminths parasites, such as Schistosoma mansoni. Here, we present a computational pipeline to identify and characterize lncRNAs from RNA-seq data with high confidence from S. mansoni adult worms. Through the utilization of different criteria such as genome localization, exon number, gene length, and stability, we identified 170 new putative lncRNAs. All novel S. mansoni lncRNAs have no conserved synteny including human and mouse. These closest protein coding genes were enriched in 10 significant Gene Ontology terms related to metabolism, transport, and biosynthesis. Fifteen putative lncRNAs showed differential expression, and three displayed sex-specific differential expressions in praziquantel sensitive and resistant adult worm couples. Together, our method can predict a set of novel lncRNAs from the RNA-seq data. Some lncRNAs are shown to be differentially expressed suggesting that those novel lncRNAs can be given high priority in further functional studies focused on praziquantel resistance.
The trematode Schistosoma mansoni is the primary parasite species responsible for schistosomiasis, a chronic debilitating disease. It is considered one of the most devastating tropical diseases in the world with at least 258 million people infected. Furthermore, 800 million people were living in endemic areas at risk of infection with more than 200,000 deaths each year [1, 2]. Its transmission has been reported in more than 78 countries, especially in tropical and subtropical areas such as Central and South America, Africa, and Southeast Asia [3, 4].
Although, in the last few decades, several drugs have been used for the treatment of schistosomiasis, with praziquantel (PZQ) representing the only widely effective agent used [5, 6]. It is effective against all species of schistosomes that infect humans and is relatively cheap and easy to use, but PZQ does not provide a cure, since young schistosomula are mostly resistant to its anthelmintic effects . This drug provides some relief to treated patients. However, young parasites, due to their intrinsic resistance to PZQ, escape elimination during treatment, grow to maturation, and begin to release eggs . This mechanism of resistance is worrying, because, under this ineffective pressure, drug resistance may also arise in humans, as in murine models [8, 9].
The S. mansoni genome is structured in 7 pairs of autosomes and one pair of sex chromosomes (female = ZW, male = ZZ). Chromosomes range in size from 18 to 73 MB and can be distinguished by size, shape, and the C-banding technique . According to the latest annotation, the S. mansoni genome is still considered as a draft with 380 Mb and 885 scaffolds. Despite this, about 81% of the bases are organized in these chromosomes. More than 45% of the predicted genes were modified, and the total number was reduced from 11,807 to 10,852 .
This parasite has a complex life cycle that involves many larval stages, an intermediate snail, and a final mammalian host. It is believed that the difference and developmental complexity observed between the different evolutionary stages and environments depend on the regulation of gene expression . Several molecules are responsible for gene expression regulation, especially long noncoding RNAs (lncRNAs). They are defined as transcripts longer than 200 nucleotides and do not encode proteins presenting several regulatory functions. They can interact with DNA, RNA molecules, and transcription factors, participating in various biological processes, mainly gene regulation .
While our knowledge of the mechanisms and scope of lncRNA-mediated regulation is growing, our understanding of how lncRNAs themselves are regulated is still quite limited . Regulating lncRNA expression would be expected to be an important cellular consideration given that lncRNAs have been implicated in regulating a variety of processes in eukaryotes including imprinting, dosage compensation, cell cycle regulation, pluripotency, retrotransposon silencing, meiotic entry, and telomere length [15–18]. They can also play important roles in numerous disease and physiological metabolism processes, such as X-chromosome inactivation, embryonic development, and pluripotency maintenance [16, 19].
These findings deeply changed disease pathobiology comprehension and led to the emergence of new biological concepts about human diseases, including the parasitic disease. Several methodologies were created to characterize and identify this RNA subtype. Noteworthy, S. mansoni researchers used these techniques and described a great picture of these lncRNAs and their participation in the disease processes. To date two studies have been published predicting lncRNAs molecules in schistosomes [20, 21] and only one  describes possible lncRNAs’ functions of 181 sequences from 7431 total predicted lncRNAs in 5 life cycle stages in this parasite, including canonically spliced putative lincRNA and spliced lncRNAs that are antisense to protein coding genes. These functions were predicted considering that lncRNAs may act by regulating their flanking protein coding gene neighbors in many processes to the rapid adaptation of the parasite to several environments.
In this study, we aimed to predict novel lncRNAs in S. mansoni via a computational pipeline and investigate features including possible functions of these sequences. Here we describe a complete new set of 170 lncRNAs in adult worms, from which we selected 15 for expression analysis in male, female, and also, for the first time, praziquantel-resistant worms which are known to have a differential expression profile in several important genes in relation to susceptible worms [8, 22, 23]. Our results show a differential expression profile of lncRNAs and reinforce the importance of this RNA subtype in schistosome biology.
2. Material and Methods
2.1. Ethics Statement
All experiments that involve animals were authorized by the Ethical Committee for Animal Care of the Federal University of Ouro Preto (CEUA-UFOP protocol 2011/55). These procedures were conducted in accordance with the accepted national and international regulations for laboratory animal use and care.
The S. mansoni LE strain was maintained by routine passage through Biomphalaria glabrata snails and BALB/c mice. The infected snails were induced to shed cercariae under light exposure for 2 hours. Adult worm parasites were obtained by liver perfusion of mice after 50 days of infection. The S. mansoni LE praziquantel-resistant (LE-PZQ) strains were obtained following a described method for inducing resistance to PZQ using infected B. glabrata snails . Infected snails were treated 3 times with 100 mg/kg PZQ for five consecutive days with a one-week interval between them. Then, after this treatment the cercariae both from treated snails (LE-PZQ strains) and from nontreated snails (LE strains susceptible) were used to infect two groups of mice. These mice were treated 45 days after infection with 200, 400, or 800 mg/kg PZQ with three PZQ treatments, each treatment administered on 5 consecutive days, with 1-week interval, for selection of less susceptible parasites to PZQ following the method developed by Couto et al. . Then, the LE-PZQ adult worms were obtained by mice liver perfusion, washed in RPMI 1640 (Sigma Chemical Co.), quick-frozen in liquid nitrogen, and stored at -80°C until use.
2.3. Data Set Download
The latest annotation data for S. mansoni (Genome Assembly release v5.2) were downloaded from GeneDB database (http://www.genedb.org) . In order to perform prediction analysis in this pipeline, a RNA-seq library from 7-week-old mixed sex adult worms was selected and downloaded from ArrayExpress database  in the FASTQ format (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-451.
2.4. Initial Transcriptome Assembly for S. mansoni
S. mansoni genome FASTA file and the annotation data in the GTF format were downloaded from GeneDB. One paired RNA-seq library was downloaded in the FASTQ format provided by sequencing using the Illumina Genome Analyzer IIx platform . The quality was then analyzed using FastQC 0.11.4 . The main parameters analyzed were the quality of the scores on the bases (quality of the Phred set as greater than 20 where it considers 1 error per 100 bases), per sequence quality scores, per sequence GC content, and removal of possible adapters used in the sequencing. After that, these reads were trimmed using Trimmomatic , and the following parameters were used: HEADCROP:15, LEADING:20, TRAILING:20, SLIDINGWINDOW:5:20, and MINLEN:50. Next filtering the quality of the files in the FastQC and removing the adapters, all the reads were mapped using the S. mansoni reference genome with the STAR aligner . In order to use this program, it was necessary to index the genome in the format compatible with STAR. Once the indexing was performed, the next step was to perform the mapping of the reads with the S. mansoni genome as reference using STAR. The output SAM files of STAR were then converted to its compressed BAM format, and then indexed and sorted with SAMtools for further analysis . Subsequently, the BAM files were assembled by Cufflinks 2.2.1  using de novo mode to assemble transcripts for 3 S. mansoni stage samples. Finally, assembly of transcripts was performed via Cufflinks, and the final file was submitted to the following pipeline filters for prediction of lncRNAs in S. mansoni. These steps were written in and performed with the use of custom Perl and Python scripts.
2.5. Identification of Novel lncRNAs Candidates
Cuffcompare was the first step used to compare the results generated by the ab initio assembly with the known annotations present in GeneDB in GTF format. The consensuses of novel transcripts were then used for further analysis. As a result of Cuffcompare, all the assemblies that were detected as new transcripts were categorized into 12 different categories according to their location compared with the S. mansoni reference genes . We kept the three following classes: unknown intergenic transcript (u), a transfrag falling entirely within a reference intron (i), and exonic overlap with reference on the opposite strand (x). This final file was submitted to the following filters written in the programming languages Python and Perl.
In the second step a filter was used to extract the sequences equal or greater than 200 nucleotides. The third step was done using CPAT (Coding Potential Assessment Tool), an alignment-free method to predict RNA coding potential using four sequence features . Only the transcripts classified by CPAT as noncoding transcripts were added to a new FASTA file. For our proposed filter, we used a prebuilt logit fly model as the classifier with optimum cutoff (CP) 0.39 (CP >=0.39 indicates coding sequence; CP < 0.39 indicates noncoding sequence) . The fourth step consisted of the extraction of transcripts that presented putative ORF (Open Reading Frame) smaller than 300 nt using OrfPredictor server . Transcripts with protein coding potential generally have ORF greater than 300 nt in size, generating proteins greater than or equal to 100 aa (amino acids) [33, 34]. In the fifth step we used an FPKM cutoff based on the distribution of the lncRNA expression level. Only FPKM values ≥2 were included in the final analysis. The sixth step was performed manually using the NCBI database as a reference and lncRNAs sequences from other works . At this step, we manually removed other RNA types, transposons, and predicted protein coding genes that have homology in other Schistosoma species. The seventh and final step was performed to remove all the known lncRNAs in S. mansoni using BLAST version 2.7.1 [35, 36].
2.6. lncRNAs Features
The novel adult lncRNAs predicted were characterized in terms of genomic localization, exon number, transcript length, and log2 FPKM using R and R Studio with ggplot2 packaged .
2.7. Gene Ontology Enrichment Analyses
The possible lncRNAs' functions were hypothesized with Gene Ontology (GO) enrichment analysis approach (http://geneontology.org/). In this analysis, all neighboring genes up to 100,000 bases upstream and downstream of the lncRNAs were selected and then their functions were evaluated . In this way, lncRNAs may act by binding to these protein coding genes by regulating their translation. The list obtained with all GO terms was created and plotted with R version 3.4.2, and R Studio version 1.0.136. The analysis method was based on Fisher's exact test and the -log10 (P value) was used to denote the significance of the GO term enrichment.
2.8. LncRNAs Expression Analysis
Approximately 100 mg adult worms were used for total RNA extraction performed according to the manufacturer’s specifications (SV Total RNA Isolation System). Expression levels of 15 lncRNAs were quantified by RT-qPCR with Applied Biosystems ABI 7300 by using SYBR-Green PCR Master Mix (Roche®). We designed specific primers for each lncRNA, endogenous control, and positive control using Gene Runner version 6.5.46 (Supplementary Table S1). For the investigated transcripts, three biological replicates were performed and normalized to the endogenous control with specific primers for S. mansoni EIF4E . A long terminal repeat (LTR) retrotransposon with 58% cover and 95% identity with S. mansoni Saci-4 LTR retrotransposon was used as a comparison parameter because it is a transcript found expressed in all chromosomes. Expression levels were calculated according to the method  using the Applied Biosystems 7300 software. All these experiments were performed following MIQE guidelines .
2.9. Statistical Analysis
Statistical analysis for RT-qPCR was performed using GraphPad Prism, version 7.0 (San Diego, CA, USA). One-way and two-way analysis of variance (ANOVA), following by Tukey multiple comparisons, were performed to investigate significant differential expression of transcripts throughout the sexes and treatments. In all cases, the differences were considered significant when P values were <0.05.
3.1. Initially Assembled Transcripts
A stranded RNA-seq data set of the whole transcriptome was used to assemble transcripts (Figure 1(a)). We first trimmed more than 10 million raw reads (FASTQ reads) obtaining a total of 6 million clean reads. More than 5 million (82.18%) of them were mapped with STAR to the S. mansoni genome (v5.2) and calculation of summary statistics (Supplementary Table S2). De novo assembly from the aligned fragments was performed using Cufflinks with all the ab initio default parameters to generate 15,776 transcripts, which were then processed through the described pipeline.
3.2. Computational Pipeline to Predict Novel lncRNAs
Our computational pipeline applied multiple filters on these transcripts (Figure 1(b)) to predict novel lncRNAs in S. mansoni adult worm. First, Cuffcompare removed the assemblies overlapping with transcripts annotated in the reference genome. This includes reconstructed protein coding transcripts or annotated known noncoding transcripts. The selection of classes u, i, and x led to the removal of almost 60% of assembled transcripts. The remaining 6426 transcripts were submitted to a length filter (≥ 200) removing 1487. After passing this length filter, 4525 transcripts were classified by CPAT as noncoding transcripts following the ORF filter (≤ 300 nt) with 3329 transcripts remaining. For both categories, transcripts with exon number ≥ 2 and FPKM ≥ 0.5, only 644 transcripts were obtained. Since this study focuses on noncoding transcripts, a manual step was performed removing other RNAs, transposons, and predicted protein coding genes in other Schistosoma species. This filter led to 256 transcripts that were submitted to the last step removing all the known lncRNAs in S. mansoni by BLAST. From a total of 256 predicted lncRNAs, 170 (66.4%) were considered potentially novel lncRNAs candidates against 86 (33.6 %) that presented a homology (at least >70%) with other lncRNAs in S. mansoni (Supplementary Table S3).
3.3. lncRNAs Features
To determine S. mansoni lncRNAs features, the genomic localization, exon number, transcript length, and log2 FPKM (Figure 2) were analyzed. In this data set we found that the majority of lncRNAs were found on chromosome 1, sexual ZW, and the scaffolds. The majority of the intronic lncRNAs were found on the sexual ZW and the unfinished scaffolds (Figure 2(a)). Moreover, lncRNAs had few exons per transcript (2-3) (Figure 2(b)) with most of them having transcript lengths between 200bp and 2000bp (Figure 2(c)). Finally, many of these lncRNAs could be confidently detected, with FPKM expression level between 2 and 9 (Figure 2(d)).
3.4. Overview of lncRNA RT-qPCR Validation.
Based on nearby encoding genes, we selected 15 lncRNAs from a set of 170 putative molecules and the LTR retrotransposon to verify RT-qPCR expressions (Figure 3(a)). This selection was performed based on the enriched GO terms of the analyzed neighboring which were relevant genes for S. mansoni biology, for example, Smp_174670 (ubiquitin conjugating enzyme E2 J1), Smp_030780 (putative ubiquitination factor E4a), and Smp_199890 (venom allergen-like 7 protein).
Our relative expression results showed that Sm-lncRNA 1, Sm-lncRNA 2, Sm-lncRNA 3, Sm-lncRNA 4, Sm-lncRNA 7, Sm-lncRNA 8, Sm-lncRNA 11, Sm-lncRNA 13, Sm-lncRNA 14, and Sm-lncRNA 15 had a low expression without statistical significance. Besides, 4 lncRNAs, Sm-lncRNA 6, Sm-lncRNA 9, Sm-lncRNA 10, and Sm-lncRNA 12 had a higher expression than this lower group. The most highly expressed lncRNA was Sm-lncRNA 5. It was even more highly expressed than the LRT retrotransposon used as a comparison parameter.
We also selected the first three lncRNAs and the LTR retrotransposon to verify the expression between sexes into four groups: Control-Male, Control-Female, PZQ-Male, and PZQ-Female (PZQ-Male and PZQ-Female are related to S. mansoni LE praziquantel-resistant strains) (Figure 3(b)). All the 4 Control-Female expressions were higher than Control-Male and PZQ-Female.
3.5. Target Gene Prediction
GO enrichment analyses for the neighborhood target genes were made on three different aspects, namely, biological process (BP), molecular function (MF), and cellular component (CC). The 10 significantly overrepresented GO terms included 389 genes involved in BP, 555 genes involved in MF, and 768 genes involved in CC (Figure 4).
The largely enriched and meaningful BP terms were related to metabolism, transport, and biosynthesis. The enriched MF terms were predominantly related to binding including catalytic activity and nucleotide binding. As for CC, the most enriched terms were cell, intracellular, and cytoplasm. A detailed table of all 15 expressed lncRNAs was made including the neighboring coding genes and their respective GO entries (Supplementary Table S4).
Several molecular and biochemical experiments revealed that lncRNAs may play diverse roles and functions in cellular biology. However, given the high abundance of lncRNAs and the poor-genetic conservation between species, the study of these molecules is extremely intriguing because they can be key regulators of species-specific biological processes. Notably, there are great prospects for a better understanding of S. mansoni biology, but much work will be needed to elucidate the specific role of lncRNAs in this parasite lifestyle. Thus, pipeline development for the S. mansoni genome is of extreme importance due to the lack of adapted and specific methodologies for the genus Schistosoma.
Current methodologies for predicting lncRNA genes are species-specific following the genome intrinsic characteristics of each species which increase data generation sensitivity and specificity. Each methodology used follows a flow chart that best adapts the steps and programs used [42–45]. This is a very important point because S. mansoni has particular characteristics since some of the repetitive fractions of DNA consist of tandemly repeated ribosomal genes of which there are 500-1000 copies per genome that represent 1.8-3.6% of the total DNA and nonribosomal repetitive sequences that comprise at least a further 2.0% of the total DNA .
In this pipeline, we introduced a manual step to remove other RNAs, transposons, and predicted protein coding genes in other Schistosoma species. In Figure 1, 39% of the initially putative lncRNAs showed significant homology with retrotransposons. This finding indicates an important difference of our pipeline. In addition, 33.6% of the putative lncRNAs identified were identical to those recently described for S. mansoni , reinforcing the hypothesis that the steps which were described in both studies are ideal for the identification of lncRNA in the genus Schistosoma.
In this work, we selected some lncRNAs to initiate expression studies, using lncRNA assumptions as criteria that are located in the proximity of genes involved with important pathways to S. mansoni survival, like the ubiquitin proteasome system, ribosomal protein (Supplementary Table S4), and others. In total, 100% of novel lncRNAs analyzed were expressed in adult worm couples (Figure 3). This is consistent with data from Vasconcelos et al.  who found similar percentages of expressed lncRNAs in S. mansoni adult worms.
The act of transcribing lncRNAs can have profound consequences on the ability of nearby genes to be expressed. For example, transcription of a lncRNA across the promoter region of a downstream protein coding gene directly interferes with transcription factor binding and thus prevents the protein coding gene from being expressed [47, 48]. We detected high levels of Sm-lncRNA 5 expression related to proteasome B1 subunit, responsible for peptidylglutamyl activity. Curiously, it is the lowest activity of the proteasome of S mansoni . The proteasomes form a pivotal component for the ubiquitin proteasome system (UPS). It is implicated in protein ubiquitination, proteolysis, and degradation and was set as essential to S. mansoni biology . Moreover, the Sm-lncRNA 12 is differentially expressed in S. mansoni and occupies the neighborhood of E2 gene , leading us to hypothesize this participation on ubiquitin proteasome regulation. These findings reinforce themselves because differentially expressed lncRNAs, identified from human placentas, may regulate their associated mRNAs through several mechanisms and connect the UPS with infection-inflammation pathways .
Another set of putative lncRNAs, neighbor genes related to protein synthesis described in this work (Supplementary Table S4), are related to protein synthesis. Recently, lncRNAs have emerged as key players in the stress responses in plants  and eukaryotic cellular senescence . The stress response is a feature of the S. mansoni lifestyle in both vertebrate and invertebrates hosts .
Furthermore, lncRNA levels dynamically change in response to various drugs. These alterations affect gene expression involved in several cells function such as cycle arrest, inhibition of apoptosis, and DNA damage repair . S. mansoni exposure to a drug, particularly PZQ, which is used in a single dose or repeatedly in reinfection, may induce drug resistance or reduced susceptibility over time . The difference in gene expression between male and female in PZQ resistance worms suggest that lncRNAs can also be involved in drug resistance mechanisms in S. mansoni.
Several studies have provided insights into how genomic neighborhoods could influence gene expression levels, with important consequences for evolution, development, and disease .
In conclusion, we have identified a novel set of 170 lncRNAs in S. mansoni expressed in male, female, and PZQ resistant adult worms. These observations suggest that lncRNAs may be significant in parasite biology and be useful therapeutic targets. Further studies are required to dissect the function and mechanism of action of these RNA subtypes in normal biology and life cycle progression.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the following Brazilian research agencies: FAPEMIG (Fundação de Amparo à Pesquisa do Estado de Minas Gerais, CBB APQ-01880-13), NuBio UFOP (Núcleo de Bioinformática da Universidade Federal de Ouro Preto), and CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico). The authors thank the following transcriptome initiatives: the São Paulo Transcriptome Consortium, the Minas Gerais Genome Network, and the Wellcome Trust Genome Initiative (UK).
Supplementary Table S1: primers pairs sequences selected for RT-qPCR experiments. Supplementary Table S2: detailed data obtained and mapping conditions for adult worm sample in RNA-seq. Supplementary Table S3: set of 170 putative lncRNAs identified by our computational pipeline. Supplementary Table S4: set of 15 expressed lncRNAs with the neighboring coding genes and their respective GO entries. (Supplementary Materials)
WHO, “Schistosomiasis. Fact sheet No. 115, January 2012,” http://www.who.int/mediacentre/factsheets/fs115/en/index.html, Accessed on Nov. 19, 2013.View at: Google Scholar
M. C. Sanchez, K. V. Krasnec, A. S. Parra et al., “Effect of praziquantel on the differential expression of mouse hepatic genes and parasite ATP binding cassette transporter gene family members during Schistosoma mansoni infection,” PLOS Neglected Tropical Diseases, vol. 11, no. 6, Article ID e0005691, 2017.View at: Publisher Site | Google Scholar
S. Andrews, “FastQC: a quality control tool for high throughput sequence data,” 2010, http://www.bioinformatics.babraham.ac.uk/projects/fastqc.View at: Google Scholar
H. Wickham, ggplot2: elegant graphics for data analysis, Springer Publishing Company, Incorporated, 2009.
J. Wang, X. Meng, O. B. Dobrovolskaya, Y. L. Orlov, and M. Chen, “Non-coding RNAs and their roles in stress response in plants,” Genomics, Proteomics & Bioinformatics, vol. 15, no. 5, pp. 301–312, 2017.View at: Google Scholar
M. Knight, W. Ittiprasert, H. D. Arican-Goktas, and J. M. Bridger, “Epigenetic modulation, stress and plasticity in susceptibility of the snail host, Biomphalaria glabrata, to Schistosoma mansoni infection,” International Journal for Parasitology, vol. 46, no. 7, pp. 389–394, 2016.View at: Publisher Site | Google Scholar
L. Lipovich, R. Johnson, and C.-Y. Lin, “MacroRNA underdogs in a microRNA world: evolutionary, regulatory, and biomedical significance of mammalian long non-protein-coding RNA,” Biochimica et Biophysica Acta (BBA)—Gene Regulatory Mechanisms, vol. 1799, no. 9, pp. 597–615, 2010.View at: Publisher Site | Google Scholar