Abstract

Extracting information about academic activity transactions from unstructured documents is a key problem in the analysis of academic behaviors of researchers. The academic activities transaction includes five elements: person, activities, objects, attributes, and time phrases. The traditional method of information extraction is to extract shallow text features and then to recognize advanced features from text with supervision. Since the information processing of different levels is completed in steps, the error generated from various steps will be accumulated and affect the accuracy of final results. However, because Deep Belief Network (DBN) model has the ability to automatically unsupervise learning of the advanced features from shallow text features, the model is employed to extract the academic activities transaction. In addition, we use character-based feature to describe the raw features of named entities of academic activity, so as to improve the accuracy of named entity recognition. In this paper, the accuracy of the academic activities extraction is compared by using character-based feature vector and word-based feature vector to express the text features, respectively, and with the traditional text information extraction based on Conditional Random Fields. The results show that DBN model is more effective for the extraction of academic activities transaction information.

1. Introduction

Academic activities are important aspects of scholars to participate in social activities. Generally, a scholar’s academic activities experience is mainly reflected in the following aspects: education experience, research experience, paper publication, and project cooperation process. Academic relationship network can be established through these academic activities. And these academic activities usually appear in the information document, such as the personal space of scholars, the application for science and technology project, and the researcher’s resume. Among them education experience and research experience are essential parts. In this paper, we use an efficient method to extract the scholar’s academic activity records from those information documents.

An academic activity record should have the following elements: person name is used to denote the subject of transaction, activity or behavior is used to describe the transaction action and properties, behavior objects are used as the place or organization of the transaction and include academic attribute and time phrases of transaction. In other words, extracting academic transaction is essential to identify named entities of the person, activities, time, place, and academic attribute information from academic texts. We use these five named entities to structurally describe an academic activity, so as to achieve the purpose of automatically extracting the records of academic activities. In early studies, the popular method is to segment Chinese words for the original documents first and use different strategies to identify named entities of different categories step by step. This kind of method not only is inefficient, but also will enlarge the error caused by each step extraction. Therefore, we propose DBN (Deep Belief Network) to solve this problem in order to achieve one-time recognition and extraction of various academic activities elements.

DBN is a deep neural network learning model and its hidden layer of neural network has excellent feature learning ability and can learn from the shallow features of the raw data to the deep level abstract features. Meanwhile, the layer by layer initialization is adopted in DBN; that is to say, each layer learns its parameters independently, which reduces the difficulty of neural network training [1]. The DBN is composed of multilayer unsupervised RBM (Restricted Boltzmann Machine) and layer BP. High-level features are extracted from raw text data with unsupervised learning of hidden layer neuron nodes in RBM. Therefore, this paper proposes using DBN method to extract the records of academic activities; it can automatically recognize the five main elements of the academic activities from the raw document and achieve the purpose of efficient and accurate extraction of academic activity records.

The key to extract the academic activity information from the unstructured text is to realize the detection and recognition of the elements in academic activity text, which is an important research topic in NLP (Natural Language Processing) [2, 3]. At present, the traditional methods used for named entity recognition are mainly ME (Maximum Entropy) [4], CRF (Conditional Random Fields) [57], and kernel function [810]. These methods require more artificial participation in text feature extraction and need multistep processing. The error of various stages will be accumulated to the next state, thus reducing the accuracy of entity recognition.

The concept of deep learning was first proposed by Professor Hinton in 2006 [11]. Deep learning method has achieved good results in image recognition and speech recognition [1214]. In recent years, the deep learning method is widely applied to NLP, and the researchers hope to use the method with excellent learning ability to recognize the abstract features from the raw features, so as to improve the processing ability of document information recognition and extraction. Owing to the feature of words that can be learned by the features of the surrounding words, in 2011, Collobert et al. using this principle designed a unified deep neural network model [15]. The model uses unsupervised learning to obtain the word-based feature vector representation and has achieved good effect in semantic role labeling and named entity recognition. In 2013, Based on Collobert’s study, Mikolov et al. proposed the Continuous Bag-of-Words (CBOW) model to predict center words with the surrounding words [16] and the Skip-Gram model for using the center words to predict the surrounding words [16]. These two models can effectively extract the text word vector and are better for semantic relational representation. At present, the commonly deep learning models include DBN, Auto-Encoder, and LSTM (Long Short-Term Memory) in NLP [3]. The DBN model constructed an energy function and used the hidden layer structure of the RBM to learn the text advanced features. In 2014, Chen used the DBN model and achieved the best effect in the contrast experiment of ACE2004 corpus entity recognition with CRF, SVM, and BP [17]. In 2016, Feng et al. used the word vector as the input of the DBN and achieved 89.58% value in entity recognition of people’s daily corpus [18]. Meanwhile, Jiang et al. extracted the text word vector feature applied to DBN model in the Reuters-21,578 and 20-Newsgroup text categories and achieved a better text classification compared with SVM and KNN [19]. The Auto-Encoder applied the minimum error to train the parameters between the input and output and to learn the depth text features by encoding and decoding the input information. Wang used Auto-Encoder to identify the person entity and place entity in corpus of people daily, which obtained 97.55% accuracy [20]. In 2016, Leng and Jiang used Auto-Encoder model to address the enterprise cooperation, the product relationship, and the enterprise demand relationship extraction [21]. LSTM is a time recursive neural network model, which has the memory property of the text sequence features, and realizes the selection of context features through the cyclic hidden layer. In 2016, Zheng et al. extracted the word vector feature applied to the LSTM depth learning model and obtained 84.8% values in SemEval-2010 Task 8 corpus named recognition [22].

Due to the powerful feature expression of deep learning, many scholars are engaged in deep learning research in NLP. The text feature representations of character-based vector and word-based vector are the two commonly used in text feature extraction, and they have different effect on the accuracy of the named entity recognition for different corpus and different deep learning model. It is still a hot issue that the character-based feature vector and the word-based feature vector are used for the input of deep learning model in different text feature recognition.

3. Category of Academic Activities Named Entity

Generally, academic activities transaction information has been fully reflected in application of science and technology projects. In the proposals, the applicant information is an essential part of the resume. We use the description of the educational experience and research experience as examples to illustrate the possible elements of the academic activities in the applicant’s resume.

Example 1. (i)“张, 1992 年业于南医科大学医疗系 (Zhang graduated from the Medical Department of Hunan Medical University in 1992)”.Person: “张 (Zhang)”,Activity or behavior: “业 (graduate)”,Temporal phrase: “1992 年 (1992)”,Behavioral object: “南医科大学 (Hunan Medical University)”,Academic attribute: “医疗系 (Medical Department)”.

Example 2. (i)李某, 1994 年至今, 南医科大学肿瘤研究所工 (Li works at the Institute of Oncology, Hunan Medical University from 1994 to present)”.Person: “李某 (Li)”,Activity or behavior: “工 (work)”,Temporal phrase: “1994 年至今 (1994 to present)”,Behavioral object: “南医科大学肿瘤研究所 (Institute of Oncology, Hunan Medical University)”. Based on the analysis of a large number of resumes in scientific and technological documents, we classified the five key elements in transaction extraction into five different types of academic activity named entities, that is, person, organization, academic activities, temporal phrase, and academic terms.

4. Academic Activities Transaction Extraction

4.1. Transaction Extraction Process

The information extraction process of academic activities transaction is divided into the following steps: text preprocessing, character-based vector representation, and academic activity entity recognition; the specific process is shown in Figure 1.

First, we extract the academic activity paragraphs from scientific and technological documents and use the ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) to carry out text word segmentation. Then, character-based vector features are established from the text words, and some of them are manually labeled with different types of academic activity named entities, so that we can obtain DBN model train sets and build test sets. Finally, on the basis of the trained DBN model, the advanced features can be extracted through unsupervised learning, ultimately realized named entity recognition of academic activity.

4.2. Character-Based Feature Vector Representation

Character-based feature vector is a common representation of text feature in NLP. Because character-based vector is a fine-grained unit for Chinese words, the vector can be used to better describe the raw features of named entities; then, it can effectively improve the accuracy of named entity recognition.

Character-based vector extraction consists of three steps. First, all training corpus entities are labeled to constitute the entity set , where represents the named entity in the training corpus and is the total number of entities. Then, taking out the words that appear in the entity set and removing the repeated words, we can get a collection of all character sets , where is the total number of words in , that is, the length of the character-based feature vector. According to expression (1), each in the entity set can be transformed into the character-based vector with the same dimension as the .where represents the element in the vector, and when , the value is 1; otherwise the value is 0.

For example, suppose that the given named entity set is as follows:“张 (Zhang),”“中南大学 (Central South University),”计算机科学与技术 (Computer Science and Technology),”“中南大学铁道学院 (Central South University Railway Institute)”.

We can get the character set as follows:

“张中南大学计算机科学与技铁道学院 (Zhang School of Computer Science and Technology, Central South University, Railway Institute)”.

The length of the set is 18, and each entity can be transformed into the following character-based vector: = “张. = “中南大学”. = “计算机科学与技术”. = “中南大学铁道学院”.

Through the above steps, the entity feature can be transformed into character-based feature with same dimension as the character set.

4.3. Deep Belief Network Model
4.3.1. Restricted Boltzmann Machine

DBN is a deep network structure model with multiple RBM (Restricted Boltzmann Machine) layers and a BP layer. RBM consists of two-layer undirected graph model, which contains visible layer and hidden layer . The nodes between the visible layers and hidden layers are connected with each other, and the same layer nodes are not connected with each other. The visible layer is used for data input, and the hidden layer is employed to extract the hidden features. The network structure model is shown in Figure 2.

The parameters are described as follows.

, denote the number of neurons contained in visible and hidden layer, respectively, denote the state vector of visible layer, is the state vector of hidden layer, is the bias vector of visible layer, is taken as the bias vector of hidden layer, and is the weight matrix between hidden and visible layer.

The RBM model is an energy based model, for a given state set of visible and hidden layer; the energy function can be defined as

Vector representation can be defined as

Based on the energy function obtained by (2), the joint probability distribution of visible and hidden layers can be obtained:where denotes normalized factor.

The hidden layer node can be obtained according to the known visible node :

Similarly, since RBM is a symmetric network, the value of hidden layer node can be reconstructed into the visible layer node:

The purpose of the RBM model training is to get the output of hidden layer for a given visible layer . Through training, we can obtain the optimal parameters and get the maximum joint probability . The hidden layer is the expression of text shallow feature to deep feature, and it can be interpreted as the reconstruction of visible layer in different space. The aim of training is to reconstruct the visible layer based on expression (6), so that makes the minimum error between the original visible layer and the reconstructed visible layer.

4.3.2. Deep Belief Network

The DBN structure model is shown in Figure 3. It is a depth learning model that contains multilayer RBM layer and a BP layer. The model has autonomic learning advanced feature from shallow feature of text and has powerful classification ability for high dimensional sparse feature vectors [11].

The training process of DBN is divided into two steps. The first step is unsupervised training of each RBM layer; the first RBM layer is composed of the raw input feature vector and the first hidden layer . By training the first layer we can get optimal parameters . After training the first layer of RBM, we put the output as the input of the second RBM layer. Finally, the whole RBM network can be unsupervised training and so on. With the supervised training BP network in next step, the error information generated by BP layer will be backpropagated to all RBM layers to fine-tune the whole model and ultimately get the optimal parameters of the DBN network.

5. Experiment Analysis

In this paper, DBN is applied to extract the transaction information in Chinese documents. We use character-based vectors and word-based vectors to extract features and compare the adaptability of these two feature expressions to the description of text features. Meanwhile, character-based vector is used for the input of DBN model and carried out tenfold cross validation experiments. The results are compared with the CRF in paper [6]. The paper uses Python deep learning framework Theano to implement the DBN algorithm and uses Python language to implement all the codes in the win 7 environment. The training corpus is derived from the applicant’s resume, which is included in the research proposals; we can get 29515 entities after the word segmentation.

5.1. Named Entity Tagging

As the word segmentation software has different division of the word with different granularity, the solid entity will be split into several parts after word segmentation. For example, “中南大学铁道学院 (Central South University Railway Institute)” will be converted to “中南大学 (Central South University)/铁道 (Railway)/学院 (Institute).” In this paper, the output label is tagged by combining BMU (beginning, middle, and unite) with the entity type X. For example, U_ORG labeling the current word is the type of ORG entity, B_ORG denotes the current word for entity ORG prefix, M_ORG represents its middle part of entity ORG, and the organization entity “中南大学/铁道/学院” can be labeled by B_ORG, M_ORG, and M_ORG. Based on the same method we can achieve other types of entity tags as shown in Table 1.

5.2. Comparative Tests of Character-Based Vector and Word-Based Vector

In this paper, character-based vector and word-based vector are used to represent feature data from training text, respectively. According to the principle of 4.2, we use Python programming language to implement text feature extraction and make up the character set about all named entity, which includes 1467 characters. In other words, the dimension of entity feature vector is 1467. For word-based vector feature extraction, the word2vec, an open-source toolkit developed by Google in 2013, is employed to implement feature extraction. Supposing that the dimension of the word-based vector of each segmentation entity is 100 and that the context window size is 2, then we can obtain a word vector of 500 dimensions by word2vec tool.

During the training of DBN model, the parameters of the model are set as follows: the pretraining rate , fine-tuning learning rate , the iterations in pretraining , and the iterations in training ; then the results are shown in Table 2 based on character-based vector and word-based vector as input of DBN model.

The results in Table 2 show that, whether in a shallow DBN model or a deep DBN model, the character-based vector is better than the word-based vector. Character-based vector making the character in the entity is formed in a character dictionary. The text feature is reflected by the high dimensional vector, and the characteristics of each entity can be expressed through the high dimensional character-based vector; furthermore, it is not introducing too much noise data. However, the word-based vector is more dependent on the adjacent words, which is used to reflect the similarity between words, and there are a large number of stop words and irrelevant words in academic experience. This kind of noise data feature as the input of DBN model will reduce the named entity recognition accuracy of DBN model; therefore, the effect is not as good as character-based vector.

5.3. DBN Academic Activity Named Entity Recognition

The 29915 entities with word segmentation tagging are divided into training sets and test sets. Given the training parameters , , and , we use character-based feature vector to carry out tenfold cross validation experiments and get the average accuracy , recall rate , and shown in Table 3.

The results from Table 3 show that the DBN model can learn the entity character feature through character-based vector to represent the raw entity feature, which can effectively reduce the interference of noise data caused by Chinese segmentation errors. Therefore, the model has achieved good results in the accuracy of various types of entity recognition. Furthermore, the model uses character-based vector as input and does not need to preprocess the proper nouns which are inaccurate in Chinese word segmentation, thus reducing the manual workload of word segmentation. Figure 4 shows the accuracy distribution curves of various entities in tenfold cross validation experiments, in order to analyze the stability.

As shown in Figure 4, the accuracy of temporal phrase entity, academic activity entity, and academic terms entity is relatively stable, due to the simply syntactic structure in the description of academic experience and the simple character features that constitute the three types of entities. The DBN model is relatively prone to learn text advanced feature, leading to a better result. The accuracy rate of the person entity in the experiment fluctuates greatly, and the minimum accuracy in Experiment  9 is 67%. Because the feature of person entity is very rich, the combination feature is relatively complex, which lead to a lower accuracy rate of entity recognition than other entities. In addition, because most of organization entities are composed of long words, the error caused by the segmentation tool on different granularity organization entities will have an impact on the accuracy; therefore, the accuracy curve shows a certain fluctuation. Table 4 presents the comprehensive accuracy , recall , and values of named entity recognition for academic activities and compares it with the results of CRF used in [6].

From Table 4, compared with the CRF model, the DBN model obtains higher accuracy and score in the recognition of academic activities. Because CRF uses word-based vector feature for sequence tagging, the effect is not as good as character-based vector feature in academic entity feature extraction. In addition, the accuracy of CRF model greatly depends on the design of feature templates to extract the advanced feature of text. In most cases, the feature templates are prone to bias, thus reducing the accuracy of entity recognition. In a word, DBN model combines the fine-grained feature representation of character-based vectors and uses hidden layer neurons to extract hierarchical advanced feature of text; thus the accuracy is higher, which verifies the effectiveness of the DBN model for extracting academic activity transaction information.

6. Conclusion

In this paper, DBN is employed to extract the information of academic activities, which is the process of extracting the text advanced feature information. In the process, artificial feature extraction is greatly reduced, and it automatically recognizes the advanced feature from shallow feature of text. Moreover, the model does not need to preset a large number of dictionaries in the word segmentation stage and does not require special preprocessing such as regular matching. For the inaccurate word segmentation, it can recognize the entity boundary through the DBN model and finally realize all kinds of named entities one-off recognition. Comparing with the CRF, the method gained a better accuracy and performance.

The results also indicate a good adaptability in the science and technology information document corpus and can be better applied to the large-scale text processing. In addition, in the process of DBN text extraction training, the number of neurons and learning parameters are set up by the exploratory verification. How to select the optimal parameters to reduce the training time needs further study. It is the focus to classify and recognize the relationships between the academic entities in next study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Science and Technology Plan of Hunan Province Project 2016JC2011 and National Natural Science Foundation of China Project 61073105.