Research Article

The Effects of Feature Optimization on High-Dimensional Essay Data

Table 1

It represents the list of features used for learning AES.

Category Types of features

Basic (i) Number of characters, words, vocabularies, and sentences
(ii) Number of characters, words, and vocabularies without stop word
(iii) Number of vocabularies with more than characters
   (a) e.g.,
(iv) Number of vocabularies with more than characters and below characters
   (a) e.g.,
(v) Number of vocabularies per frequency of word   
(vi) Frequency of the most frequent words without stop word 
(vii) Square of the number of vocabularies
(viii) Average length of word and sentence
(ix) Variance of sentence length
(x) Average distance between the same words and lemmas
(xi) Number of POS types
(xii) Average number of POS types per sentence
(xiii) Average frequency of word per POS type
(xiv) Maximum frequency of word per POS type
(xv) Ratio of each POS type (by word and character)
(xvi) Number of words and vocabularies per POS type

Dictionary (i) Ratio of words and vocabularies in each dictionary
   (a) Elementary, middle, and GRE dictionary
(ii) Ratio of advanced words and vocabularies
   (a) Number of words in elementary and middle dictionary/number of words in GRE dictionary

n-gram (i) Number of n-gram types (word bigram, POS trigram)
(ii) Maximum frequency of n-gram
(iii) Average frequency of n-gram types
(iv) Average frequency, ratio of n-gram type appeared over times
(v) Ratio of n-gram type appeared over times
   (a) e.g.,
(vi) Ratio of n-gram type appeared over times and below times
   (a) e.g.,
(vii) Average, maximum, minimum, and variance of perplexity of word or POS sequence
(viii) Subtraction of maximum and minimum perplexity of word or POS sequence
(ix) Number of sentences below perplexity threshold

Advanced NLP (i) Average number, maximum frequency, ratio of compound nouns, noun phrases, and named entities per sentence
(ii) Frequency of discourse marker
(iii) Weighted sum of discourse marker
(iv) Number of mechanic grammatical errors
(v) Number of pattern grammatical errors