Research Article

A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification

Table 2

Statistics of datasets (the vocabulary in datasets has gone through data cleaning, excluding single characters and punctuations as well as retaining only the lemmatized words).

Dataset Yelp 2016 Amazon Reviews (Electronics)

# classes 5 5
# documents 4,153,150 1,689,188
# average sentences/document8.116.88
# average words/sentence 17.02 7.65
# average words/document 138.02 136.97
# maximal sentences in document 166 416
# maximal words in document1,4317,488
# words in vocabulary 155,498 66,551