Research Article

Research on Application of Intelligent Corpus Annotation of Entity Extraction with Construction of Knowledge Graph

Algorithm 1

Loop iteration entity extraction model of bootstrapping and K-fold joint voting.
Input: manual annotation training set A, a large number of preprocessed unlabeled corpus B, unlabeled data B′ extracted in each round, prediction result probability threshold Th, cross-validation fold number K.
Process:
(1)Divide training set A into training set A1 and test set S based on the 9 :  1 ratio;
(2)Divide training set A1 into d1, d2, d3,…,dK by means of K-fold cross-validation, with 1 copy reserved as a test set each time, the remaining K-1 copies used as training sets, with a model obtained through training R1, R2, R3,…,RK;
(3)Calculate the test errors ε1, ε2, ε3,…,εK on the model R1, R2, R3,…,RK, respectively, and get the cross-validation error after averaging, perform the final quality evaluation test on the test set S to obtain test errors ε′, and ensure the cross-validation error ε and test error ε′ are smaller than the threshold by repeating n rounds;
(4)Randomly extract a small part of data B′ from a large number of unlabeled corpus B;
(5)Integrate R1, R2, R3,…,RK submodels into a joint-voting model with the same weight and tag the randomly selected data B′ by means of voting. When the consistency rate of the pseudolabel prediction results of K models is greater than 80%, the tags are regarded as positive tags, with the labeling results added to the training set A and deleted from B, the iterative learning of this round completed.
(6)Repeat steps (1)-(5) above and training is completed until the unlabeled set B is empty or the maximum number of iterations N is reached.
Output: the continuously expanding training set A′ and the named entity recognition model R of joint voting