Review Article

A Review of Recent Advancement in Integrating Omics Data with Literature Mining towards Biomedical Discoveries

Table 2

Standard corpora for omics domain.

CorpusText mining evaluation taskBrief introduction

JNLPBA (Joint Workshop on NLP in Biomedicine and Its Applications) [18]Gene/protein concept extractionThe corpus consists of 2,000 PubMed abstracts as training data and 404 PubMed abstracts as test data.

BioCreAtivE 2004 Task 1A dataset [19]Gene/protein concept extractionThe corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

BioCreAtivE 2 Gene Mention (GM) dataset [20]Gene/protein concept extractionThe corpus consists of 15,000 PubMed sentences as training data and 5,000 PubMed sentences as test data.

AIMED [21]Protein-protein interactionThe corpus consists of 225 PubMed abstracts that contain 1,987 sentences with 4,075 protein mentions.

HPRD50 (Human Protein Reference Database) [22]Protein-protein interactionThe corpus consists of sentences with protein-protein interaction from 50 PubMed abstracts.

BioInfer (Bio Information Extraction Resource) [23]Protein, gene, and RNA relationshipsThe corpus consists of 1100 sentences annotated with concept names, relationships, and syntactic dependencies.

IEPA (Interaction Extraction Performance Assessment) [24]Protein-protein interactionThe corpus consists of more than 200 PubMed sentences annotated with protein-protein interaction.

BioCreAtivE 2.5 Elsevier Corpus [25]Protein-protein interactionThe corpus consists of 61 PubMed articles as training data and 62 PubMed articles as test data.

BC4GO Corpus [26]Gene ontologyThe corpus consists of 1356 distinct GO terms from 200 PubMed articles.

GREC Corpus [27]Gene regulation and gene expression eventsThe corpus consists of 240 PubMed abstracts with annotations on gene regulation and gene expression events.

GETM [28]Gene expression eventsThe corpus consists of 150 PubMed abstracts with annotation for gene expression events.

AnEM [29]Tissue, cell, developing anatomical structure, cellular componentThe corpus consists of 500 PubMed sentences with annotations on variety of biomedical concepts.

CellFinder Corpus [30]Anatomical parts, cell lines, cell types, species, and cell componentsThe corpus consists of annotations from 10 full-text PubMed articles.