ISRN Bioinformatics The latest articles from Hindawi Publishing Corporation © 2014 , Hindawi Publishing Corporation . All rights reserved. Hierarchical Ensemble Methods for Protein Function Prediction Mon, 05 May 2014 06:16:31 +0000 Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware “flat” prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a “consensus” ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research. Giorgio Valentini Copyright © 2014 Giorgio Valentini. All rights reserved. Comparison of Merging and Meta-Analysis as Alternative Approaches for Integrative Gene Expression Analysis Sun, 12 Jan 2014 11:43:32 +0000 An increasing amount of microarray gene expression data sets is available through public repositories. Their huge potential in making new findings is yet to be unlocked by making them available for large-scale analysis. In order to do so it is essential that independent studies designed for similar biological problems can be integrated, so that new insights can be obtained. These insights would remain undiscovered when analyzing the individual data sets because it is well known that the small number of biological samples used per experiment is a bottleneck in genomic analysis. By increasing the number of samples the statistical power is increased and more general and reliable conclusions can be drawn. In this work, two different approaches for conducting large-scale analysis of microarray gene expression data—meta-analysis and data merging—are compared in the context of the identification of cancer-related biomarkers, by analyzing six independent lung cancer studies. Within this study, we investigate the hypothesis that analyzing large cohorts of samples resulting in merging independent data sets designed to study the same biological problem results in lower false discovery rates than analyzing the same data sets within a more conservative meta-analysis approach. Jonatan Taminau, Cosmin Lazar, Stijn Meganck, and Ann Nowé Copyright © 2014 Jonatan Taminau et al. All rights reserved. NucVoter: A Voting Algorithm for Reliable Nucleosome Prediction Using Next-Generation Sequencing Data Thu, 07 Nov 2013 14:41:56 +0000 Nucleosomes, which consist of DNA wrapped around histone octamers, are dynamic, and their structure, including their location, size, and occupancy, can be transformed. Nucleosomes can regulate gene expression by controlling the DNA accessibility of proteins. Using next-generation sequencing techniques along with such laboratory methods as micrococcal nuclease digestion, predicting the genomic locations of nucleosomes is possible. However, the true locations of nucleosomes are unknown, and it is difficult to determine their exact locations using next-generation sequencing data. This paper proposes a novel voting algorithm, NucVoter, for the reliable prediction of nucleosome locations. Multiple models verify the consensus areas in which nucleosomes are placed by the model with the highest priority. NucVoter significantly improves the performance of nucleosome prediction. Boseon Byeon Copyright © 2013 Boseon Byeon. All rights reserved. Discovery of YopE Inhibitors by Pharmacophore-Based Virtual Screening and Docking Mon, 21 Oct 2013 15:10:32 +0000 Gram-negative bacteria Yersinia secrete virulence factors that invade eukaryotic cells via type III secretion system. One particular virulence member, Yersinia outer protein E (YopE), targets Rho family of small GTPases by mimicking regulator GAP protein activity, and its secretion mainly induces cytoskeletal disruption and depolymerization of actin stress fibers within the host cell. In this work, potent drug-like inhibitors of YopE are investigated with virtual screening approaches. More than 500,000 unique small molecules from ZINC database were screened with a five-point pharmacophore, comprising three hydrogen acceptors, one hydrogen donor, and one ring, and derived from different salicylidene acylhydrazides. Binding modes and features of these molecules were investigated with a multistep molecular docking approach using Glide software. Virtual screening hits were further analyzed based on their docking score, chemical similarity, pharmacokinetic properties, and the key Arg144 interaction along with other active site residue interactions with the receptor. As a final outcome, a diverse set of ligands with inhibitory potential were proposed. Gizem Ozbuyukkaya, Elif Ozkirimli Olmez, and Kutlu O. Ulgen Copyright © 2013 Gizem Ozbuyukkaya et al. All rights reserved. Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies Thu, 12 Sep 2013 09:11:19 +0000 RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. Shanrong Zhao, Kurt Prenger, and Lance Smith Copyright © 2013 Shanrong Zhao et al. All rights reserved. Modern Computational Techniques for the HMMER Sequence Analysis Tue, 03 Sep 2013 14:58:31 +0000 This paper focuses on the latest research and critical reviews on modern computing architectures, software and hardware accelerated algorithms for bioinformatics data analysis with an emphasis on one of the most important sequence analysis applications—hidden Markov models (HMM). We show the detailed performance comparison of sequence analysis tools on various computing platforms recently developed in the bioinformatics society. The characteristics of the sequence analysis, such as data and compute-intensive natures, make it very attractive to optimize and parallelize by using both traditional software approach and innovated hardware acceleration technologies. Xiandong Meng and Yanqing Ji Copyright © 2013 Xiandong Meng and Yanqing Ji. All rights reserved. Construction and Analysis of the Cell Surface’s Protein Network for Human Sperm-Egg Interaction Mon, 12 Aug 2013 09:23:46 +0000 Sperm-egg interaction is one of the most impressive processes in sexual reproduction, and understanding the molecular mechanism is crucial in solving problems in infertility and failed in vitro fertilization. The main purpose of this study is to map the sperm-egg interaction network between cell-surface proteins and perform an interaction analysis on this new network. We built the first protein interaction network of human sperm-egg binding and fusion proteins that consists of 84 protein nodes and 112 interactions. The gene ontology analysis identified a number of functional clusters that may be involved in the sperm-egg interaction. These include G-protein coupled receptor protein signaling pathway, cellular membrane fusion, and single fertilization. The PPI network showed a highly interconnected network and identified a set of candidate proteins: ADAM-ZP3, ZP3-CLGN, IZUMO1-CD9, and ADAM2-IZUMO1 that may have an important role in sperm-egg interaction. The result showed that the ADAM2 may mediate interaction between two essential factors CD9 and IZUMO1. The KEGG analysis showed 12 statistically significant pathways with 10 proteins associated with cancer, suggesting a common pathway between tumor fusion and sperm-egg fusion. We believe that the availability of this map will assist future researches in the fertilization mechanism and will also facilitate biological interpretation of sperm-egg interaction. Soudabeh Sabetian Fard Jahromi and Mohd Shahir Shamsir Copyright © 2013 Soudabeh Sabetian Fard Jahromi and Mohd Shahir Shamsir. All rights reserved. A Computational Approach towards the Understanding of Plasmodium falciparum Multidrug Resistance Protein 1 Thu, 01 Aug 2013 15:12:16 +0000 The emergence of drug resistance in Plasmodium falciparum tremendously affected the chemotherapy worldwide while the intense distribution of chloroquine-resistant strains in most of the endemic areas added more complications in the treatment of malaria. The situation has even worsened by the lack of molecular mechanism to understand the resistance conferred by Plasmodia species. Recent studies have suggested the association of antimalarial resistance with P. falciparum multidrug resistance protein 1 (PfMDR1), an ATP-binding cassette (ABC) transporter and a homologue of human P-glycoprotein 1 (P-gp1). The present study deals about the development of PfMDR1 computational model and the model of substrate transport across PfMDR1 with insights derived from conformations relative to inward- and outward-facing topologies that switch on/off the transportation system. Comparison of ATP docked positions and its structural motif binding properties were found to be similar among other ATPases, and thereby contributes to NBD domains dimerization, a unique structural agreement noticed in Mus musculus Pgp and Escherichia coli MDR transporter homolog (MsbA). The interaction of leading antimalarials and phytochemicals within the active pocket of both wild-type and mutant-type PfMDR1 demonstrated the mode of binding and provided insights of less binding affinity thereby contributing to parasite’s resistance mechanism. Saumya K. Patel, Linz-Buoy George, Sivakumar Prasanth Kumar, Hyacinth N. Highland, Yogesh T. Jasrai, Himanshu A. Pandya, and Ketaki R. Desai Copyright © 2013 Saumya K. Patel et al. All rights reserved. SUMOhunt: Combining Spatial Staging between Lysine and SUMO with Random Forests to Predict SUMOylation Mon, 17 Jun 2013 13:43:58 +0000 Modification with SUMO protein has many key roles in eukaryotic systems which renders the identification of its target proteins and sites of considerable importance. Information regarding the SUMOylation of a protein may tell us about its subcellular localization, function, and spatial orientation. This modification occurs at particular and not all lysine residues in a given protein. In competition with biochemical means of modified-site recognition, computational methods are strong contenders in the prediction of SUMOylation-undergoing sites on proteins. In this research, physicochemical properties of amino acids retrieved from AAIndex, especially those involved in docking of modifier and target proteins and optimal presentation of target lysine, in combination with sequence information and random forest-based classifier presented in WEKA have been used to develop a prediction model, SUMOhunt, with statistics significantly better than all previous predictors. In this model 97.56% accuracy, 100% sensitivity, 94% specificity, and 0.95 MCC have been achieved which shows that proposed amino acid properties have a significant role in SUMO attachment. SUMOhunt will hence bring great reliability and efficiency in SUMOylation prediction. Amna Ijaz Copyright © 2013 Amna Ijaz. All rights reserved. Exploiting Identifiability and Intergene Correlation for Improved Detection of Differential Expression Mon, 03 Jun 2013 13:47:08 +0000 Accurate differential analysis of microarray data strongly depends on effective treatment of intergene correlation. Such dependence is ordinarily accounted for in terms of its effect on significance cutoffs. In this paper, it is shown that correlation can, in fact, be exploited to share information across tests and reorder expression differentials for increased statistical power, regardless of the threshold. Significantly improved differential analysis is the result of two simple measures: (i) adjusting test statistics to exploit information from identifiable genes (the large subset of genes represented on a microarray that can be classified a priori as nondifferential with very high confidence], but (ii) doing so in a way that accounts for linear dependencies among identifiable and nonidentifiable genes. A method is developed that builds upon the widely used two-sample t-statistic approach and uses analysis in Hilbert space to decompose the nonidentified gene vector into two components that are correlated and uncorrelated with the identified set. In the application to data derived from a widely studied prostate cancer database, the proposed method outperforms some of the most highly regarded approaches published to date. Algorithms in MATLAB and in R are available for public download. J. R. Deller Jr., Hayder Radha, and J. Justin McCormick Copyright © 2013 J. R. Deller et al. All rights reserved. Transcriptome Analysis of Spermophilus lateralis and Spermophilus tridecemlineatus Liver Does Not Suggest the Presence of Spermophilus-Liver-Specific Reference Genes Sun, 26 May 2013 10:15:22 +0000 The expressions of reference genes used in gene expression studies are assumed to be stable under most circumstances. However, studies had demonstrated that genes assumed to be stably expressed in a species are not necessarily stably expressed in other organisms. This study aims to evaluate the likelihood of genus-specific reference genes for liver using comparable microarray datasets from Spermophilus lateralis and Spermophilus tridecemlineatus. The coefficient of variance (CV) of each probe was calculated and there were 178 probes common between the lowest 10% CV of both datasets (). All 3 lists were analysed by NormFinder. Our results suggest that the most invariant probe for S. tridecemlineatus was 02n12, while that for S. lateralis was 24j21. However, our results showed that Probes 02n12 and 24j21 are ranked 8644 and 926 in terms of invariancy for S. lateralis and S. tridecemlineatus respectively. This suggests the lack of common liver-specific reference probes for both S. lateralis and S. tridecemlineatus. Given that S. lateralis and S. tridecemlineatus are closely related species and the datasets are comparable, our results do not support the presence of genus-specific reference genes. Bryan M. H. Keng, Oliver Y. W. Chan, Sean S. J. Heng, and Maurice H. T. Ling Copyright © 2013 Bryan M. H. Keng et al. All rights reserved. IsoPlotter+: A Tool for Studying the Compositional Architecture of Genomes Thu, 18 Apr 2013 11:14:21 +0000 Eukaryotic genomes, particularly animal genomes, have a complex, nonuniform, and nonrandom internal compositional organization. The compositional organization of animal genomes can be described as a mosaic of discrete genomic regions, called “compositional domains,” each with a distinct GC content that significantly differs from those of its upstream and downstream neighboring domains. A typical animal genome consists of a mixture of compositionally homogeneous and nonhomogeneous domains of varying lengths and nucleotide compositions that are interspersed with one another. We have devised IsoPlotter, an unbiased segmentation algorithm for inferring the compositional organization of genomes. IsoPlotter has become an indispensable tool for describing genomic composition and has been used in the analysis of more than a dozen genomes. Applications include describing new genomes, correlating domain composition with gene composition and their density, studying the evolution of genomes, testing phylogenomic hypotheses, and detect regions of potential interbreeding between human and extinct hominines. To extend the use of IsoPlotter, we designed a completely automated pipeline, called IsoPlotter+ to carry out all segmentation analyses, including graphical display, and built a repository for compositional domain maps of all fully sequenced vertebrate and invertebrate genomes. The IsoPlotter+ pipeline and repository offer a comprehensive solution to the study of genome compositional architecture. Here, we demonstrate IsoPlotter+ by applying it to human and insect genomes. The computational tools and data repository are available online. Eran Elhaik and Dan Graur Copyright © 2013 Eran Elhaik and Dan Graur. All rights reserved. HMEC: A Heuristic Algorithm for Individual Haplotyping with Minimum Error Correction Mon, 28 Jan 2013 12:34:43 +0000 Haplotype is a pattern of single nucleotide polymorphisms (SNPs) on a single chromosome. Constructing a pair of haplotypes from aligned and overlapping but intermixed and erroneous fragments of the chromosomal sequences is a nontrivial problem. Minimum error correction approach aims to minimize the number of errors to be corrected so that the pair of haplotypes can be constructed through consensus of the fragments. We give a heuristic algorithm (HMEC) that searches through alternative solutions using a gain measure and stops whenever no better solution can be achieved. Time complexity of each iteration is for an SNP matrix where and are the number of fragments (number of rows) and number of SNP sites (number of columns), respectively, in an SNP matrix. Alternative gain measure is also given to reduce running time. We have compared our algorithm with other methods in terms of accuracy and running time on both simulated and real data, and our extensive experimental results indicate the superiority of our algorithm over others. Md. Shamsuzzoha Bayzid, Md. Maksudul Alam, Abdullah Mueen, and Md. Saidur Rahman Copyright © 2013 Md. Shamsuzzoha Bayzid et al. All rights reserved. CallSim: Evaluation of Base Calls Using Sequencing Simulation Wed, 12 Dec 2012 11:52:59 +0000 Accurate base calls generated from sequencing data are required for downstream biological interpretation, particularly in the case of rare variants. CallSim is a software application that provides evidence for the validity of base calls believed to be sequencing errors and it is applicable to Ion Torrent and 454 data. The algorithm processes a single read using a Monte Carlo approach to sequencing simulation, not dependent upon information from any other read in the data set. Three examples from general read correction, as well as from error-or-variant classification, demonstrate its effectiveness for a robust low-volume read processing base corrector. Specifically, correction of errors in Ion Torrent reads from a study involving mutations in multidrug resistant Staphylococcus aureus illustrates an ability to classify an erroneous homopolymer call. In addition, support for a rare variant in 454 data for a mixed viral population demonstrates “base rescue” capabilities. CallSim provides evidence regarding the validity of base calls in sequences produced by 454 or Ion Torrent systems and is intended for hands-on downstream processing analysis. These downstream efforts, although time consuming, are necessary steps for accurate identification of rare variants. Jarrett D. Morrow and Brandon W. Higgs Copyright © 2012 Jarrett D. Morrow and Brandon W. Higgs. All rights reserved. Electric LAMP: Virtual Loop-Mediated Isothermal AMPlification Wed, 21 Nov 2012 14:59:38 +0000 We present eLAMP, a PERL script, with Tk graphical interface, that electronically simulates Loop-mediated AMPlification (LAMP) allowing users to efficiently test putative LAMP primers on a set of target sequences. eLAMP can match primers to templates using either exact (via builtin PERL regular expressions) or approximate matching (via the tre-agrep library). Performance was tested on 40 whole genome sequences of Staphylococcus. eLAMP correctly predicted that the two tested primer sets would amplify from S. aureus genomes and not amplify from other Staphylococcus species. Open source (GNU Public License) PERL scripts are available for download from the New York Botanical Garden's website. Nelson R. Salinas and Damon P. Little Copyright © 2012 Nelson R. Salinas and Damon P. Little. All rights reserved. Classifying Multigraph Models of Secondary RNA Structure Using Graph-Theoretic Descriptors Sun, 11 Nov 2012 11:48:05 +0000 The prediction of secondary RNA folds from primary sequences continues to be an important area of research given the significance of RNA molecules in biological processes such as gene regulation. To facilitate this effort, graph models of secondary structure have been developed to quantify and thereby characterize the topological properties of the secondary folds. In this work we utilize a multigraph representation of a secondary RNA structure to examine the ability of the existing graph-theoretic descriptors to classify all possible topologies as either RNA-like or not RNA-like. We use more than one hundred descriptors and several different machine learning approaches, including nearest neighbor algorithms, one-class classifiers, and several clustering techniques. We predict that many more topologies will be identified as those representing RNA secondary structures than currently predicted in the RAG (RNA-As-Graphs) database. The results also suggest which descriptors and which algorithms are more informative in classifying and exploring secondary RNA structures. Debra Knisley, Jeff Knisley, Chelsea Ross, and Alissa Rockney Copyright © 2012 Debra Knisley et al. All rights reserved. A Robust Topology-Based Algorithm for Gene Expression Profiling Sun, 11 Nov 2012 11:31:54 +0000 Early and accurate diagnoses of cancer can significantly improve the design of personalized therapy and enhance the success of therapeutic interventions. Histopathological approaches, which rely on microscopic examinations of malignant tissue, are not conducive to timely diagnoses. High throughput genomics offers a possible new classification of cancer subtypes. Unfortunately, most clustering algorithms have not been proven sufficiently robust. We propose a novel approach that relies on the use of statistical invariants and persistent homology, one of the most exciting recent developments in topology. It identifies a sufficient but compact set of genes for the analysis as well as a core group of tightly correlated patient samples for each subtype. Partitioning occurs hierarchically and allows for the identification of genetically similar subtypes. We analyzed the gene expression profiles of 202 tumors of the brain cancer glioblastoma multiforme (GBM) given at the Cancer Genome Atlas (TCGA) site. We identify core patient groups associated with the classical, mesenchymal, and proneural subtypes of GBM. In our analysis, the neural subtype consists of several small groups rather than a single component. A subtype prediction model is introduced which partitions tumors in a manner consistent with clustering algorithms but requires the genetic signature of only 59 genes. Lars Seemann, Jason Shulman, and Gemunu H. Gunaratne Copyright © 2012 Lars Seemann et al. All rights reserved. Hybrid-Controlled Neurofuzzy Networks Analysis Resulting in Genetic Regulatory Networks Reconstruction Thu, 01 Nov 2012 07:55:08 +0000 Reverse engineering of gene regulatory networks (GRNs) is the process of estimating genetic interactions of a cellular system from gene expression data. In this paper, we propose a novel hybrid systematic algorithm based on neurofuzzy network for reconstructing GRNs from observational gene expression data when only a medium-small number of measurements are available. The approach uses fuzzy logic to transform gene expression values into qualitative descriptors that can be evaluated by using a set of defined rules. The algorithm uses neurofuzzy network to model genes effects on other genes followed by four stages of decision making to extract gene interactions. One of the main features of the proposed algorithm is that an optimal number of fuzzy rules can be easily and rapidly extracted without overparameterizing. Data analysis and simulation are conducted on microarray expression profiles of S. cerevisiae cell cycle and demonstrate that the proposed algorithm not only selects the patterns of the time series gene expression data accurately, but also provides models with better reconstruction accuracy when compared with four published algorithms: DBNs, VBEM, time delay ARACNE, and PF subjected to LASSO. The accuracy of the proposed approach is evaluated in terms of recall and F-score for the network reconstruction task. Roozbeh Manshaei, Pooya Sobhe Bidari, Mahdi Aliyari Shoorehdeli, Amir Feizi, Tahmineh Lohrasebi, Mohammad Ali Malboobi, Matthew Kyan, and Javad Alirezaie Copyright © 2012 Roozbeh Manshaei et al. All rights reserved. Dynamic Clustering of Gene Expression Tue, 16 Oct 2012 15:41:02 +0000 It is well accepted that genes are simultaneously involved in multiple biological processes and that genes are coordinated over the duration of such events. Unfortunately, clustering methodologies that group genes for the purpose of novel gene discovery fail to acknowledge the dynamic nature of biological processes and provide static clusters, even when the expression of genes is assessed across time or developmental stages. By taking advantage of techniques and theories from time frequency analysis, periodic gene expression profiles are dynamically clustered based on the assumption that different spectral frequencies characterize different biological processes. A two-step cluster validation approach is proposed to statistically estimate both the optimal number of clusters and to distinguish significant clusters from noise. The resulting clusters reveal coordinated coexpressed genes. This novel dynamic clustering approach has broad applicability to a vast range of sequential data scenarios where the order of the series is of interest. Lingling An and R. W. Doerge Copyright © 2012 Lingling An and R. W. Doerge. All rights reserved. Differential Expression Analysis for RNA-Seq Data Thu, 20 Sep 2012 18:02:43 +0000 RNA-Seq is increasingly being used for gene expression profiling. In this approach, next-generation sequencing (NGS) platforms are used for sequencing. Due to highly parallel nature, millions of reads are generated in a short time and at low cost. Therefore analysis of the data is a major challenge and development of statistical and computational methods is essential for drawing meaningful conclusions from this huge data. In here, we assessed three different types of normalization (transcript parts per million, trimmed mean of M values, quantile normalization) and evaluated if normalized data reduces technical variability across replicates. In addition, we also proposed two novel methods for detecting differentially expressed genes between two biological conditions: (i) likelihood ratio method, and (ii) Bayesian method. Our proposed methods for finding differentially expressed genes were tested on three real datasets. Our methods performed at least as well as, and often better than, the existing methods for analysis of differential expression. Rashi Gupta, Isha Dewan, Richa Bharti, and Alok Bhattacharya Copyright © 2012 Rashi Gupta et al. All rights reserved. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm Tue, 04 Sep 2012 13:58:35 +0000 A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. Xinyu Guo, Hong Wang, and Vijay Devabhaktuni Copyright © 2012 Xinyu Guo et al. All rights reserved. Enhancing De Novo Transcriptome Assembly by Incorporating Multiple Overlap Sizes Mon, 23 Apr 2012 10:49:09 +0000 Background. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly. Chien-Chih Chen, Wen-Dar Lin, Yu-Jung Chang, Chuen-Liang Chen, and Jan-Ming Ho Copyright © 2012 Chien-Chih Chen et al. All rights reserved. Nonlinear Dependence in the Discovery of Differentially Expressed Genes Thu, 12 Apr 2012 12:17:04 +0000 Microarray data are used to determine which genes are active in response to a changing cell environment. Genes are “discovered” when they are significantly differentially expressed in the microarray data collected under the differing conditions. In one prevalent approach, all genes are assumed to satisfy a null hypothesis, ℍ0, of no difference in expression. A false discovery (type 1 error) occurs when ℍ0 is incorrectly rejected. The quality of a detection algorithm is assessed by estimating its number of false discoveries, 𝔉. Work involving the second-moment modeling of the z-value histogram (representing gene expression differentials) has shown significantly deleterious effects of intergene expression correlation on the estimate of 𝔉. This paper suggests that nonlinear dependencies could likewise be important. With an applied emphasis, this paper extends the “moment framework” by including third-moment skewness corrections in an estimator of 𝔉. This estimator combines observed correlation (corrected for sampling fluctuations) with the information from easily identifiable null cases. Nonlinear-dependence modeling reduces the estimation error relative to that of linear estimation. Third-moment calculations involve empirical densities of 3×3 covariance matrices estimated using very few samples. The principle of entropy maximization is employed to connect estimated moments to 𝔉 inference. Model results are tested with BRCA and HIV data sets and with carefully constructed simulations. J. R. Deller Jr., Hayder Radha, J. Justin McCormick, and Huiyan Wang Copyright © 2012 J. R. Deller et al. All rights reserved. Chemical Entity Recognition and Resolution to ChEBI Wed, 15 Feb 2012 10:04:11 +0000 Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks. Tiago Grego, Catia Pesquita, Hugo P. Bastos, and Francisco M. Couto Copyright © 2012 Tiago Grego et al. All rights reserved. Signal Peptidase Complex Subunit 1 and Hydroxyacyl-CoA Dehydrogenase Beta Subunit Are Suitable Reference Genes in Human Lungs Wed, 28 Dec 2011 11:32:56 +0000 Lung cancer is a common cancer, and expression profiling can provide an accurate indication to advance the medical intervention. However, this requires the availability of stably expressed genes as reference. Recent studies had shown that genes that are stably expressed in a tissue may not be stably expressed in other tissues suggesting the need to identify stably expressed genes in each tissue for use as reference genes. DNA microarray analysis has been used to identify those reference genes with low fluctuation. Fourteen datasets with different lung conditions were employed in our study. Coefficient of variance, followed by NormFinder, was used to identify stably expressed genes. Our results showed that classical reference genes such as GAPDH and HPRT1 were highly variable; thus, they are unsuitable as reference genes. Signal peptidase complex subunit 1 (SPCS1) and hydroxyacyl-CoA dehydrogenase beta subunit (HADHB), which are involved in fundamental biochemical processes, demonstrated high expression stability suggesting their suitability in human lung cell profiling. Issac H. K. Too and Maurice H. T. Ling Copyright © 2012 Issac H. K. Too and Maurice H. T. Ling. All rights reserved. Construction of a Drug Safety Assurance Information System Based on Clinical Genotyping Tue, 29 Nov 2011 14:24:55 +0000 To capitalize on the vast potential of patient genetic information to aid in assuring drug safety, a substantial effort is needed in both the training of healthcare professionals and the operational enablement of clinical environments. Our research aims to satisfy these needs through the development of a drug safety assurance information system (GeneScription) based on clinical genotyping that utilizes patient-specific genetic information to predict and prevent adverse drug responses. In this paper, we present the motivations for this work, the algorithms at the heart of GeneScription, and a discussion of our system and its uses. We also describe our efforts to validate GeneScription through its evaluation by practicing pharmacists and pharmacy professors and its repeated use in training pharmacists. The positive assessment of the GeneScription software tool by these domain experts provides strong validation of the importance, accuracy, and effectiveness of GeneScription. John A. Springer, Nicholas V. Iannotti, Jon E. Sprague, and Michael D. Kane Copyright © 2012 John A. Springer et al. All rights reserved. Bio301: A Web-Based EST Annotation Pipeline That Facilitates Functional Comparison Studies Tue, 22 Nov 2011 15:49:45 +0000 In this postgenomic era, a huge volume of information derived from expressed sequence tags (ESTs) has been constructed for functional description of gene expression profiles. Comparative studies have become more and more important to researchers of biology. In order to facilitate these comparative studies, we have constructed a user-friendly EST annotation pipeline with comparison tools on an integrated EST service website, Bio301. Bio301 includes regular EST preprocessing, BLAST similarity search, gene ontology (GO) annotation, statistics reporting, a graphical GO browsing interface, and microarray probe selection tools. In addition, Bio301 is equipped with statistical library comparison functions using multiple EST libraries based on GO annotations for mining meaningful biological information. Yen-Chen Chen, Yun-Ching Chen, Wen-Dar Lin, Chung-Der Hsiao, Hung-Wen Chiu, and Jan-Ming Ho Copyright © 2012 Yen-Chen Chen et al. All rights reserved.