Research Article

Exploiting Semantic Annotations and -Learning for Constructing an Efficient Hierarchy/Graph Texts Organization

Table 3

Datasets details for the experimental setup.

Dataset ExperimentingDataset NameDescription

DS1 Experiment  1: content-based evaluationMiller and Charles (MC)1RG consists of 65 pairs of nouns extracted from the WordNet, rated by multiple human annotators.
DS2Microsoft Research Paraphrase Corpus (MRPC)2The corpus consists of 5,801 sentence pairs collected from newswire articles, 3,900 of which were labeled as relatedness by human annotators. The whole set is divided into a training subset (4,076 sentences of which 2,753 are true) and a test subset (1,725 pairs of which 1,147 are true).

DS3Experiment  2: coselection-based evaluation (closest-synonym detection)British National Corpus (BNC)3BNC is a carefully selected collection of 4124 contemporary written and spoken English texts, contains 100-million-word text corpus of samples of written and spoken English with the near-synonym collocations.
DS4SN (semantic neighbors)4SN relates 462 target terms (nouns) to 5910 relatum terms with 14.682 semantic relations (7341 are meaningful and 7341 are random). The SN contains synonyms coming from three sources: WordNet 3.0, Roget’s thesaurus, and a synonyms database.

DS5Experiment  2: coselection-based evaluation (semantic relationships exploration) BLESS6BLESS relates 200 target terms (100 animate and 100 inanimate nouns) to 8625 relatum terms with 26.554 semantic relations (14.440 are meaningful (correct) and 12.154 are random). Every relation has one of the following types: hypernymy, cohypernymy, meronymy, attribute, event, or random.
DS6TREC5TREC includes 1437 sentences annotated with entities and relations at least one relation. There are three types of entities: person (1685), location (1968), and organization (978); in addition there is a fourth type others (705), which indicates that the candidate entity is none of the three types. There are five types of relations: located in (406) indicates that one location is located inside another location, work for (394) indicates that a person works for an organization, OrgBased in (451) indicates that an organization is based in a location, live in (521) indicates that a person lives at a location, and kill (268) indicates that a person killed another person. There are 17007 pairs of entities that are not related by any of the five relations and hence have the NR relation between them which thus significantly outnumbers other relations.
DS7IJCNLP 2011-New York Times (NYT)6NYT contains 150 business articles from NYT. There are 536 instances (208 positive, 328 negative) with 140 distinct descriptors in NYT dataset.
DS8IJCNLP 2011-Wikipedia8Wikipedia personal/social relation dataset was previously used in Culotta et al. [6]. There are 700 instances (122 positive, 578 negative) with 70 distinct descriptors in Wikipedia dataset.

DS9Experiment  3: task-based evaluationReuters 21,5787Reuters-21,578 contains 21,578 documents (12,902 are used) categorized to 10 categories.
DS1020 Newsgroups820 newsgroups dataset contains 20,000 documents (18,846 are used) categorized to 20 categories.

1Available at http://www.cs.cmu.edu/~mfaruqui/suite.html.
2Available at http://research.microsoft.com/en-us/downloads/.
3Available at http://corpus.byu.edu/bnc/.
4Available at http://cental.fltr.ucl.ac.be/team/~panchenko/sre-eval/sn.csv.
5Available at http://l2r.cs.uiuc.edu/~cogcomp/Data/ER/conll04.corp.
6Available at http://www.mysmu.edu/faculty/jingjiang/data/IJCNLP2011.zip.
7Available at http://mlr.cs.umass.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection.
8Available at http://www.csmining.org/index.php/id-20-newsgroups.html.