Abstract

Key-gram extraction can be seen as extracting -grams which can distinguish different registers. Keyword (as , 1-gram is the keyword) extraction models are generally carried out from two aspects, the feature extraction and the model design. By summarizing the advantages and disadvantages of existing models, we propose a novel key -gram extraction model “attentive -gram network” (ANN) based on the attention mechanism and multilayer perceptron, in which the attention mechanism scores each -gram in a sentence by mining the internal semantic relationship between words, and their importance is given by the scores. Experimental results on the real corpus show that the key -gram extracted from our model can distinguish a novel, news, and text book very well; the accuracy of our model is significantly higher than the baseline model. Also, we conduct experiments on key -grams extracted from these registers, which turned out to be well clustered. Furthermore, we make some statistical analyses of the results of key -gram extraction. We find that the key -grams extracted by our model are very explanatory in linguistics.

1. Introduction

A register refers to the common vocabulary, sentence structure, rhetorical means, and other characteristics of the language used when people use language to communicate in various social activities for different persons in different environments [1]. With the development of the information age, much register information is produced in production and life. In various Internet applications, the register has played a pivotal role. To better realize automatic text processing, we need to distinguish different registers. As a component of texts, words in the sentence contain rich semantic information, which play very important roles in distinguishing different registers. However, previous studies have demonstrated that -gram based words have better results than words in register classification tasks [2].

Key-gram extraction can be thought of as extracting -grams to distinguish different registers.

The existing models are mainly based on words, and there are few studies on the extraction of key -grams (). Many keyword extraction models have been put forward and have achieved significant effect, due to the development of deep learning models and the attention mechanisms [36]. To some extent, each feature extraction model has its advantages and disadvantages.

In terms of keyword extraction, previous scholars mainly proceeded from two aspects, one is the feature extraction and the other is the model design. The features extracted are mainly the word frequency, term frequency-inverse document frequency (TF-IDF), latent Dirichlet allocation (LDA), synonym set, NP phrases, syntactic information, word2vec, or other specific domain knowledge, such as tags and candidate keywords [7]. The model design for these features are mainly from three aspects, statistical language models, graph-based models, and machine learning models.

1.1. Statistical Language Models

These models combine linguistic knowledge with statistical methods to extract keywords. Such keyword extraction models are based on the word frequency, POS, lexical chains, -grams, etc. [6, 8, 9]. The advantages of these methods are simple implementation and effective extraction of keywords. Unfortunately, the features chosen by these methods are often based on frequency or countability, without considering the semantic relationship between words and sentences.

These methods lead to some problems, etc.; high-frequency words sometimes are not the keywords, for example, a lot of stop words in linguistics appeared many times (usually, most stop words in Chinese are auxiliary words), but they are not important words for registers. Even though some models select high-frequency words by removing stop words, they are not accurate in the semantic expression of the registers. Especially in a novel with a lot of dialogues in it, we know that in conversation, according to the context, many words are often omitted. If the stop words are removed, the meaning of the sentence is changed completely. Similar problems exist in the features using TF-IDF methods.

1.2. Graph-Based Models

Compared with statistical language models, these models map linguistic features to graph structures, in which words in sentences are represented by nodes in graphs and the relationship between words is represented by edges. Then, the linguistic problems are transformed into graphical problems and the graphical algorithms are applied to feature extraction. In recent years, researchers have tried to use the graphical model to mine keywords in texts. Biswas et al. proposed a method of using collective node weight based on a graph to extract keywords. Their model determined the importance of keywords by calculating the impact parameters, such as centrality, location, and neighborhood strength, and then chose the most important words as the final keywords [10]. Zhai et al. proposed a method to extract keywords, which constructed a bilingual word set and took it as a vertex, using the attributes of Chinese-Vietnamese sentences and bilingual words to construct a hypergraph [11].

These models based on graphs transform abstract sentences into intuitive images for processing and use the graph algorithm to extract the keyword. But the disadvantage is that these algorithms are based on the strong graph theory knowledge, which requires researchers to have strong linguistic knowledge and graph theory knowledge. Only in this way can these two theories be well connected. Besides, the graphs built from the texts usually have thousands or even millions of nodes and relations, which brings efficiency problem to the graph algorithms.

1.3. Machine Learning Models

With the development of the Internet, as the size of the corpus grows larger, there are more and more corpus-based research [12, 13]. It is also an inevitable trend to use a machine learning model to mine its internal laws. Many scholars employed machine learning models to extract keywords. Uzun proposed a method based on Bayesian algorithm to extract keywords according to the frequency and position of words in the training set [14]. Zhang et al. proposed extracting keywords from global context information and local context information by SVM algorithm [15]. Compared with the statistical language models, these early machine learning algorithms based on the word frequency, location information, and global and local context information have made significant improvements in feature extraction. In fact, from the features selected by these models, scholars have tried to consider the selection of features from more aspects. It is just that these features need to be extracted artificially [1417]. With the development of a computer hardware and the neural network, more complex and efficient models emerged, that is, a deep learning model. Meanwhile, various feature representation methods appeared, such as word2vec and doc2vec. Many scholars began to use deep learning models to extract keywords. Wang and Zhang proposed a method based on a complex combination model, a bi-directional long short-term memory (LSTM) recurrent neural network (RNN), which has achieved outstanding results [35]. It can be said that keyword extraction based on a deep learning model not only improved the accuracy of keyword extraction significantly but also enriched the corresponding feature representation. The disadvantage is that the models like the LSTM model has a high requirement for computer hardware and generally needs a long time to train them.

Attention mechanism is proposed by Bahdanau et al. in 2014 [18]. The models with the attention mechanism are widely used in various fields for its transparency and good effects in aggregating a bunch of features. Then, Bahdanau applied the attention mechanism to the machine translation system, which improved the accuracy of their system significantly. In this process, attention mechanism was used to extract the important words in sentences [18]. Pappas and Popescu-Belis proposed a document classification method that applied the attention mechanism to extract words distinguishing different documents, and the classification accuracy rate was greatly improved [19]. The significantly improved classification accuracy implied that the words extracted by the attention mechanism can distinguish different documents well. Similarly, the application of the attention mechanism in other fields also proves this point [20].

By analyzing and summarizing the advantages and disadvantages of these models, we propose a simple and efficient model based on attention mechanism and multilayer perceptron (MLP) to extract key -grams that can distinguish different registers. Here, we call this model the “attentive n-gram network”(ANN) for short, whose structure is shown in Figure 1. The model ANN consists of eight parts, the input layer, embedding layer, -gram vector, attention layer, -gram sentence vector, concatenation, classification, and output. In other words, the input layer is the sentence we want to classify, the embedding layer is to vectorize the words in the sentence, and the -gram vector is to convert the word vector into the corresponding -gram representation. The attention layer is to score -grams in the sentence. The -gram sentence vector is a weighted summation of -gram vectors to form a sentence vector and the result of the attention layer. Concatenation concatenates sentence vectors from -grams with different as inputs to the classifier. Classification is a classifier, and the output layer includes three parts, the category of sentences, -grams, and -gram-corresponding scores. In Figure 1, we will use an example to illustrate it. Experimental results show that our model ANN achieves significant and consistent improvement comparing with the baseline model. In particular, our work has contributed to the following aspects: (1)Using attention mechanism to extract key -grams which can distinguish different registers(2)Compared with machine learning methods such as SVM and Bayesian, the classification accuracy has been significantly improved by using ANN based on semantic information(3)With the training process of ANN, attention mechanism has low scores on stop words, which can automatically filter the stop words

2. Methodology

2.1. Attentive -Gram Network

In computer science, deep learning has become a popular method and has shown its powerful modeling ability in many areas, such as computer vision and natural language processing [21]. Among the basic neural network design patterns, there is a structure called attention mechanism which can automatically analyze the importance of different information. In the field of natural language processing, such as in machine translation, people use the attention mechanism to calculate the source keywords [18].

In our case, the task is to analyze which keywords or 2-gram phrases bear key information for differentiating registers. We first conduct a classification task on texts of different registers and apply the attention mechanism to keywords. Attention mechanism aims to calculate the importance of the words helping to identify the registers, which the higher weights will be assigned to prompt the classification task. Words with higher weights are more important to the register, in contrast to those that appear in every registers, e.g., stop words.

Formally, we can suppose to have a word dictionary and a set of sentences . Each sentence consists of a sequence of words , where is the length of the sentence . Here, we highlight vectors in bold. For example, are the vectors of word and the sentence vector of sentence . Word vectors in our model can be randomly initialized or pretrained word2vec, which corresponds to the embedding layer in Figure 1.

2.1.1. Attention Mechanism on -Grams

The attention mechanism in our model takes -gram vectors as inputs and returns sentence vectors as outputs. In particular, the vectors of -grams are formed by the concatenation of word vectors. For example, the sentence can also be represented as -grams , where is the th -gram of the sentence and its vector is (“|” here means concatenation), -gram vectors in Figure 1. Then, the attention network first uses a fully connected layer to calculate the latent attention vector of each -gram : where are the parameters of the attention network and denotes the hidden layer size of the attention network and is the size of the word vectors ( is the size of -gram vectors). is the activation function which introduces nonpolynomial factors to the neurons [22]. It has been proved that multilayer feed forward networks with a nonpolynomial activation function can approximate any function [23]. Therefore, it is common that people use a nonpolynomial activation function after a fully connected layer. The is the hidden attention vector which contains the information of word importance. Then, a weighted sum is conducted over the latent attention vector: where are the weights over the dimensions of and the parameters of the attention network. The result is the score attention mechanism giving to the -gram . Note that , if scores of different -grams are directly used to do a weighted sum over all word vectors to form the sentence vector, the length and scale of the sentence vectors will be out of control. To normalize the weights, a softmax function is conducted over all (In mathematics, the softmax function, also known as softargmax [24] or normalized exponential function [25], is a function that takes as input a vector of real numbers and normalizes it into a probability distribution consisting of probabilities.): where is the attention weight of the -gram in sentence . Note that . Each of the words in the sentence can be scored by equations (1), (2), and (3). For example, si  = {这是我妹妹。}, in Figure 1, when , the score for each word in sentence is . Similarly, when , the score corresponding to each 2-gram in is . In other words, equations (1), (2), and (3) belong to the attention layer, through which -gram scores in a sentence are scored, corresponding to the attention layer in Figure 1.

The -gram sentence vector is formed as follows:

In general, the sentence vector in -gram comes from a weighted sum of the -gram vectors . But the weights are dynamically generated through the attention network. Different -grams have different weights in different sentences. The attention network will learn how to evaluate their importance and will return the weights during the training process.

Specifically, when , the sentence vector is a weighted sum of word vectors . To take different -grams into consideration, we concatenate the sentence vector in different . For example, in our further experiments, when considering both words and 2-grams, e.g., ANN (1,2-gram), the final sentence representation is as follows:

This part corresponds to concatenation of Figure 1. The final representations of sentence vectors are then fed into higher layers for language register classification.

2.1.2. Register Classification

We utilize a multilayer perceptron (MLP)[21] to classify the registers. MLP is a kind of feed forward artificial neural network, which consists of at least three layers of nodes: the input layer, the hidden layer, and the output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP uses the supervised learning technique called back propagation to train and distinguish different registers. In our paper, this module takes the sentence vector as inputs and returns the probabilities of coming from different registers as outputs; this part corresponds the classification in Figure 1, whose structure is shown in the red part of Figure 1.

Although the task of our model is the classification of registers, what we are really focused on is the different keywords used in different registers, represented by the attention weights. It is not necessary to design a complex or elaborated classification module because what we want is actually a powerful attention network, as mentioned in Section 2.1.1. Supposing that is the set of all different registers and is the number of classes. In an efficient and effective way, it uses two fully connected layers to build the classification module: where are the model parameters, is the size of sentence vector , and is the size of the hidden layer. Then, has the size of , which are the unnormalized probabilities of belonging to different registers. To normalize the probabilities, a softmax layer is conducted on the : where is the th value of and is the probability of the sentence belonging to the class . To give the prediction, we let the corresponding to the maximum , e.g., , be the predicted class.

The model is trained through cross entropy loss [26], a well-known classification loss in machine learning: where the is the label (real class) of the sentence and the loss function is used to optimize the model. Usually, the closer the loss is to 0, the better the model will be. In our work, we use the Adam function as the optimizer [27].

2.2. Extraction Key -Gram

After training, ANN gives low weights to all of the -grams in the sentence because these -grams are all short common relations frequently occurring in many sentences, which are not representative. Suppose that feature occurs in documents and its weights in the documents are , then, the feature importance can be a weighted average of : where is the number of input features of document (e.g., number of words when input features are words). It is used to normalize the importance because getting a high weight in a long document is more difficult. Then, the features can be sorted according to the importance and the features with higher importance than a predefined threshold 1.0 are selected.

2.3. Type/Token Ratios (TTR)

TTR is the ratio of the type to token in the corpus. The so-called type number refers to how many different words are in the text. The token number refers to how many words are in the text. To some extent, the ratio of the type to token reflects the richness of words of the text. But the TTR calculated in this way is influenced by the length of the article, so here, we use a modified method to calculate TTR, Herdan’s log TTR [28, 29]. The formula is as follows:

In our further experiments, we need to calculate the of different registers to measure their richness, namely, , , and .

3. Experiment

3.1. Datasets

Experiments and further linguistic analyses are conducted on three corpus datasets. The Novel contains 20 texts, among which 12 are novels written by Mo Yan and 8 novels by Yu Hua. The novels by Mo Yan are Cotton Fleece, Breast and Buttocks, Red Sorghum, Mangrove, Dionysian, Life and death Fatigue, Thirteen Steps, Herbivorous Family, 41 Guns, Sandalwood Penalty, Paradise Garlic Song, and Frog. Those by Yu Hua are Seventh Day, Classical Love, To Live, Reality, Brothers, Brothers-2, Xu Sanguan Sells Blood, and Shouting in Drizzle. In general, Mo Yan and Yu Hua’s novels are mainly based on depicting character stories, further revealing the shortcomings behind the social background and the fate of the characters themselves.

The news is a public dataset (https://www.sogou.com/labs/resource/list_yuliao.php), which covers ten topics including domestic, international, sports, social, stock, hot spots, education, health, finance, and real estate. This is a public corpus, and many scholars have used this corpus to do researches.

The textbook mainly includes LuXun’s novel “KongYiji”, “Hometown”, “AQ True Biography”, and “Blessing”, as well as Lao She’s “Camel Xiangzi”, Shakespeare’s play “Romeo and Juliet”, Gorky’s “Sea Swallow”, “How Steel Is Made”, “Honest Children”, Zhu Ziqing’s “Prose”, “Back”, “Hurry”, “The Analects”, etc. It can be seen that the text book is a collection of registers, mainly selected by educational articles for students to learn.

The statistics of these datasets are shown in Table 1. Here, the train data and test data are from 0.8 and 0.2 of the novel, news, and text book, which we use to train and test models, respectively. Moreover, to train the model in a better way, we divided the datasets into different proportions, that is, the training set and the test set were 0.7 : 0.3, 0.8 : 0.2, and 0.9 : 0.1, respectively. It is found that the accuracy of the model is as high as shown in Table 2 when the ratio of the training set and the test set is 0.8 : 0.2.

3.2. Research Procedures

Our experiments are divided into these steps, as shown in Figure 2. Next, we describe each part of the flow chart 2 in detail. (1)Corpus preprocessing includes the corpus set, preprocessing, and corpus vectorization. Preprocessing uses toolkits to clean data and segment words and sentences. The main toolkits are Python3.6 (https://www.python.org/downloads/release/python-360/) and Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/). Corpus vectorization is to translate each sentence into the corresponding word number progressive representation. The result corresponds to the “input layer” shown in Figure 1(2)The model consists of two parts, attention mechanism and MLP classification. The function of the attention mechanism is to score every -gram in every sentence by using equations (1), (2), and (3), which are shown in Section 2.1. MLP classification is a classifier for stylistic classification. These two parts correspond to the attention layer in Figure 1; the working process of these two parts is described in Figure 1(3)Key -gram extraction averages the scores of -gram in each register, whether they appear or occur many times. When -gram appears in three registers at the same time, the key -gram with the highest score is regarded as the key -gram of the register in which it belongs(4)Key -gram analyses are composed of key 1,2-gram clustering and key-gram analyses, in which key 1,2-gram clustering clusters the key -grams extracted from the previous step. Key-gram analyses not only carries out linguistic analyses on the clustering results of key 1,2-gram clustering but also statistical analyses on the key -grams extracted from the previous step

In addition, the most important task is to find out the key -grams of each register.

3.3. Experimental Settings

To train our model, we employ the grid search to select the best combination of parameters for the model. These parameters include learning rate and the batch size . Also, our inputs are the sentence vectors, so we need to set the length of each sentence. According to Figure 3(a), we find that the quarter of the sentence length is 10, the average sentence length is 20, three quarters of the sentence length is 40, and the longest sentence is 128. Since our corpus is composed of three registers, we also calculate the average sentence length of each register, which functions as our reference values to choose the sentence length parameter. From Figure 3(b), the average sentence lengths of the novel and text book are close to 20 and the average sentence length of the news is close to 30. Hence, we set the sentence length set . The sentence vector size is the total number of words in these three registers; the word size is 32. The best combination of parameters are shown in black bold. For other parameters that have less impacts on our model, we adopt the default values.

To reduce the impact of different sizes of the corpus, we adopt the random sampling method and take corpora of equal size as the training set and test set for our model. Here, we adopt accuracy to evaluate our model. Accuracy measures how many of the sentences predicted as positive are effectively true positive examples. Besides, when , which means the keyword extraction, we use keywords instead of 1-grams, whose structure is the green part of Figure 1. When , we use 2-grams, which is shown in the blue part of Figure 1.

3.4. Result and Result Analyses

In Figures 4(a) and 4(b), “number” refers to the cumulative number of features in a certain range and “density” means the number of features in a certain interval. From Figure 4(a), we find that the distribution of keyword scores is mainly concentrated in the interval [0.1, 0.5]. In Figure 4(b), the score of 2-gram is distributed in the interval [0.0, 0.4]. In Figure 4(c), in interval [0.3, 1.0], the scores of keywords are greater than that of 2-grams.

Combined with Table 3, we find that in interval [0.0, 0.1], keywords account for 0.0079 and 2-grams account for 0.2543, which indicates that about 0.2543 of 2-grams has little effect on classification. In interval [0.0, 0.3], the keyword stands at 0.3741 and 2-gram accounts for 0.6396; this shows that more than half of 2-grams play a small role in stylistic classification. From the score interval [0.3, 1.0], the proportions of keywords and 2-grams are 0.6259 and 0.3604, respectively, which further show that keywords play a more prominent role in classification than 2-grams.

Furthermore, in Figure 4(c), we find that 2-gram has prominence in the attachment of point 1.0, which indicates that some 2-grams with higher scores play a good role in stylistic classification.

Our experiments are divided into three groups, namely, ANN (keywords), ANN (2-grams), and ANN (1,2-grams), whose structures consist of the green and the blue parts of Figure 1. Through these models, we can extract the keywords and 2-grams for the novel, news, and text nook. The experimental results are compared with the baseline, and the results are shown in Table 2 on the test set and train data.

3.5. Linguistic Analyses

We mainly analyze the differences among three registers from four aspects, corpus content analyses, lexical richness analyses, keyword analyses, and key 2-gram analyses.

3.5.1. Corpus Content Analyses

To better understand the key -grams of different registers, we mainly analyze the content characteristics of the novel, news, and text book. Their specific statistics are shown in Table 1. (i)Novel. We choose Mo Yan and Yu Hua’s novels as our collection of novels. Mo Yan is the 2012 Nobel Prize winner, whose novels often use bold words and colorful words. The sentences in his novels are rich in style, which contains long sentences, compound sentences, and simple sentences, and the author described things in a way that is unconstrained. The language of Yu Hua’s novels is profoundly influenced by Western philosophical language, whose traits are simplicity, vividness, fluidity, and dynamic(ii)News. As a register of recording and reporting facts, the news usually has several characteristics, such as authenticity, timeliness, and correctness. Authenticity means that the content must be accurate. Timeliness means that the contents is time limited. Correctness means that the reporting of time, place, and characters must be consistent with the facts(iii)Text Book. As a kind of register to impart knowledge to students, the text book focuses on training the listening, speaking, reading, and writing skills of students, with the aim of broadening their vision and knowledge scope. So, there are many kinds of articles in a text book for students to learn, which mainly contains prose, novels, inspiration, patriotism, ideological and moral education, and other stories

3.5.2. Lexical Richness Analyses

According to equation (10), on the same-size corpus, the higher the TTR value is, the richer the words are. We calculate the TTR of the novel, news, and text book, respectively:

Since is greater than and greater than , comparatively speaking, the novel has the richest vocabulary, followed by the text book, and the News.

3.5.3. Keyword Analyses

We analyze the differences among the novel, news, and text book from the proportion of POS and syllable. The statistical distribution of POS is shown in Figure 5(a) and the proportion of syllable distribution shown in Figure 5(b). The data in Figure 5(a) and 5(b) are based on a training set and test set. In Figure 5(a), we can obtain POS in each register from high to low as follows: (i)Novel. NN, VV, NR, AD, JJ, VA, SP, CD, OD, and CS.(ii)News. NN, VV, CD, NR, AD, JJ, VA, NT, M, and OD(iii)Text Book. NN, VV, NR, AD, JJ, VA, CD, and SP

In Figure 5, we find that the number of nouns (NN) in each register is the highest. To better analyze, we subdivide nouns (NN) into small parts according to their semantic information shown in Table 4.

The specific meanings of abbreviations in Tables 57 are given in Tables 4 and 8. The abbreviations in Table 4 are designed by ourselves and the contents of Table 8 are from the Chinese Treebank Marker of Pennsylvania [30]. We will analyze the distribution of each POS in the novel, news, and text book. Take the following POS as examples, as follows. (1)POS-NN. In Figure 5(a), we find that the proportion of nouns (NN) ranks the highest in the novel, then in the text nook, and the lowest in the news. Combined with Table 5, we find that there are more than 12 kinds of nouns (NN) in the novel, such as NPE, RN, PA, GN, TN, BN, and EN. These nouns (NN) refer to characters, events, time, descriptions, etc., which correspond to the content characteristics of Section 3.5.1. In news, these nouns (NN) mainly include NT, NPE, PSC, RN, OR, CN, etc., which are the names of the main organization, domain noun, occupations, time, group organization, etc., which are from Table 6. Therefore, we find that the news focuses on a wide range of groups, not individuals. These nouns (NN) in the text nook are names, time, plants, animals, events, natural phenomena, etc.; the specific examples of these abbreviations are shown in Table 7. Hence, we find that textbooks focus on describing people, things, etc.(2)POS-VV. In Figure 5(a), verbs (VV) are the most in the text book, followed by the novel, and the last by the news. Combining Tables 57, we find that verbs (VV) in the novel are mainly body-related verbs, such as such as “笑” (laugh), “哭” (cry), “跑” (run), “走” (walk), “跳” (jump), and “唱” (sing). Among them, “说” (say) and “问” (ask) are related to the mouth, “走” (walk) and “跑” (run) are related to the feet, and “抱” (embrace) to the hands; this is related to the characteristics of the novel. In the news, verbs (VV) are mainly dummy verbs and continuous verbs. For example, “进行” (do) is a dummy verb and “上涨” (go up) is a continuous verb. News uses these verbs to express its formal and solemn tone. Text book includes not only body verbs but also personalized verbs; the latter are rich in the text book because of the wide range of registers selected in the text book(3)POS-CD. Also, in Figure 5, we find that CD is the most in the news, followed by the text book, and the Novel. As demonstrated in Table 6, we find that there are a lot of numbers in the news. It can be said that quantitative figures are used in the news to express what is mentioned, rather than vague words, such as approximate grade words, “一半” (half) and “大量” (lots of), which can be often found in the novel and text book. In addition, the correctness of the news is also reflected in using a large number of numerals

In Figure 5(b), we find that the distribution of syllabic words in different registers ranges from high to low as follows: (i)Novel. 2 syllables, 4 syllables, 3 syllables, monosyllable, multisyllable(ii)News. 2 syllables, 4 syllables, monosyllable, 3 syllables, multisyllable(iii)Text Book. 2 syllables, 4 syllables, 3 syllables, monosyllable, multisyllable

We analyze the distribution of each syllable in each register, taking these syllables as examples, as follows: (1)Monosyllable. In Figure 5(b), monosyllabic words are the most in the novel, followed by the text book, and the news. As shown in Tables 57, we find that most of the monosyllabic words in the novel are body-related words. These verbs are related to specific parts of the body. According to the content of the novel in Section 3.5.1, we know that it is consistent with the characteristics of the novel which mainly depicts the specific actions of the characters. In the text book, because there are many novels, there are more monosyllables in the text book. With the simplification of Chinese phonetics, homonyms have significantly increased. If monosyllabic words are still widely used in the news, it will inevitably lead to misunderstanding, which hinders the role of language as a tool. Therefore, more accurate polysyllabic words are used in the news(2)2 Syllables. In Figure 5(b), we find that disyllabic words are the most frequent in the news, followed by the text book, and the last by the novel. Combined with Tables 57, we find that the news uses disyllabic words to express a formal and solemn tone. For example, “表决” (vote), “申明” (instruction), etc., instead of “说” (say) in the novel and text book. In addition, there are more disyllabic verbs in the news, in the novel, and in the text book; disyllabic words are mostly nouns (NN), such as “鼻子” (nose) and “眼窝” (eye socket).

3.5.4. Key 2-Gram Analyses

In Figure 6, we can conclude that the main 2-gram structure of each register is from high to low, as follows: (i)Novel. , , , , , , , , and PROVERB (http://library.umac.mo/ebooks/b26028347.pdf)(ii)News. , , , , , , , , , and (iii)Text Book. , , , , , , , and

Here, denotes a sentence or a phrase ending with sp. Combining Tables 911, we analyze the distribution of each 2-gram structure in different registers, mainly taking these 2-gram structures as examples, as follows: (1). In Figure 6, we find that the proportion of this structure is the highest in the novel and text book and the lowest in the news. In the novel, the examples of the structure are shown in Table 9. Combining Section 3.5.1, we find that the novel contains many dialogues. For the text book, some novels were selected in it, so there are also many conversations. Referring to Tables 9 and 11, structural can be regarded as a description of the action and behavior of NN or NR, etc. In Section 3.5.3, we know that the verbs (VV) in structure are body-related verbs, which are consistent with the characteristics of the novel that mainly describes the action of the characters. Contrary to this, in the news, there are many dummy verbs, such as “进行” (do) shown in Table 10. The reason for using such dummy verbs in the news is that these verbs are consistent with its own serious register. For the text book, as there are a lot of novels in it, the structure is the same as that in the novel(2). In Figure 6, we find that the structure is the most in the news, then in the novel, and in the text book. In conjunction with Table 10, we find that the examples of structure are composed of two disyllabic words, such as “市场经济” (market economy) and “试点阶段” (pilot stage). Wang and Zhang once pointed out that such a structure of “disyllabic words” has pan-temporal characteristics. That is to say, the structure of “disyllabic words” is widely used in the news, which can describe things in a more accurate way from a higher angle [31]. This is in line with the characteristics of the news. Therefore, there are a large number of such two-syllable structure used in the news(3). In Figure 6, the structure is the most in the text book, followed by the novel, and last by the news. From the perspective of the whole content of Table 11, comparing with the news, we find that the text book description is more meticulous, such as “脸色惨白” (pale face), “旧衬衣” (old shirt), “严格的” (strict), and “哪个混蛋” (which bastard), which are shown in Table 11. Combining with the contents of the text book in Section 3.5.1, the trait of the text book is the more meticulous kind of descriptions, which can better help students learn and improve their writing ability. Besides, as a formal written language, the news is simple and serious. The language of the novel and text book are more casual and flexible; therefore, in the dialogue between the novel and the text book, this kind of structure often appears and the noun of the structure is always omitted

3.6. Cluster Verification

To verify the effect of our extracted keywords and 2-grams, we use the t-SNE [32] method to cluster keywords and 2-grams. The input of t-SNE is -grams trained by the attention network, which are high-dimensional vectors. Compared with other clustering methods, the t-SNE clustering method can distinguish high-dimensional data very well and has a good visualization effect. The clustering result is shown in Figure 7. Here, we only show several main keywords and 2-grams in Figure 7.

From the results of keyword clustering in Figure 7(a) and 2-gram clustering in Figure 7(b), we find that the effect of keyword clustering is better than that of 2-gram clustering, which is consistent with our conclusion in Section 4. In Figure 7, we find that the news is more concentrated and the core is centered on “经济” (economy). The novel can be divided into two groups, because our novel corpus consists of works written by two authors, Mo Yan and Yu Hua, as indicated by the red circle in Figure 7(a); each red circle represents one class. The text book is more scattered. This is because the theme of the text book corpus is diversified, so the clustering of the text book contains several patches. Next, we analyze it from the left to the right and from the bottom to the top.

From the left side of Figure 7(a), the names of these protagonists are mainly found in Mo Yan’s novels. The reason is that Mo Yan’s novels focus on the anti-Japanese war and the Cultural Revolution. The right side of the novel shows the protagonists of Yu Hua’s works. Yu Hua’s works mainly focus on the era after the reform and opening up of China; therefore, Yu Hua’s work is closer to the news. For the text book, we found that the clustering algorithm groups “鲁迅” (LuXun)’s work together and “孔子” (Confucius) and “庄子” (Zhuangzi) are also grouped together, as well as works related to “罗密欧” (Romeo) and “朱丽叶” (Juliet). On the one hand, we can see that “余华” (Yu Hua) and “鲁迅” (LuXun)’s collections are relatively close, which shows that Yu Hua and “鲁迅” (LuXun)’s writing registers are relatively similar. We found that after graduating from high school in 1977, “余华” (Yu Hua) entered the Beijing “鲁迅” (LuXun) College of Literature for further study and he might, therefore, have been influenced by “鲁迅” (LuXun)’s writing register during his studies. On the other hand, the text book is similar to the news, especially “列宁” (Lenin), “梅兰芳” (Mei Lanfang) etc. We found that these keywords are related to the theme of patriotism and thus are closer to the news. From the clustering results above, word vectors trained by our model have a significant effect.

Unlike keyword clustering, the novel and text book are not well distinguished in 2-gram clustering and the clustering results of the novel and text book are relatively discrete. The reason is that there are dialogues in both the novel and text book, especially the structures such as “NN + VV” and “” shown in Figure 6, which leads to a poor distinction between fiction and textbooks. There are few dialogues in the news, and the structures of “NN + NN” and “VV + VV” shown in Figure 6 for the news are very significant.

In fact, in our paper, the attention network has two functions. One is to extract -gram keywords that can distinguish the novel, news, and text book; the other is to obtain the vectorization of each -gram by training the attention network.

4. Conclusion and Future Work

We propose a model attentive -gram network (ANN) for key -gram extraction. Our model makes full use of the spatial semantic information of words, and the attention mechanism scores each -gram in the sentence. With the increasing accuracy of training models, the attention mechanism scores each word more accurately. In the experiment, the classification accuracy of our model is significantly higher than the baseline accuracy. In particular, our model is not limited to 1,2-grams, but of -grams is also applicable to 3, 4, 5, and 6 as well. In the future, we will conduct further explorations in the following two directions: (i)We will further explore the factors that influence the attention mechanism, such as the length of sentences and the occurrence number of keywords, to improve the analyses of the characteristics of each register(ii)We will also extend phrase structures to sentences and paragraphs to explore registers. In this way, we can study the register more comprehensively from keywords, phrases, phrase structures, sentences, and paragraphs

Data Availability

Part of this article USES the data set is available, such as News (https://www.sogou.com/labs/resource/list_yuliao.php), and the novel and textbook are protected.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the Tsinghua University Humanities and Social Sciences Revitalization Project (2019THZWJC38), the Project of Baidu Netcom Technology Co. Ltd. Open Source Course and Case Construction Based on the Deep Learning Framework PaddlePaddle (20202000291), the Distributed Secure Estimation of Multi-sensor Systems Subject to Stealthy Attacks (62073284), the Multi-sensor-based Estimation Theory and Algorithms with Clustering Hierarchical Structures (61603331), and the National Key Research and Development Program of China (2019YFB1406300).