Mathematical Problems in Engineering

Research Article

Research on Application of Intelligent Corpus Annotation of Entity Extraction with Construction of Knowledge Graph

Algorithm 1

Loop iteration entity extraction model of bootstrapping and K-fold joint voting.

	Input: manual annotation training set A, a large number of preprocessed unlabeled corpus B, unlabeled data B′ extracted in each round, prediction result probability threshold Th, cross-validation fold number K.
	Process:
(1)	Divide training set A into training set A₁ and test set S based on the 9 : 1 ratio;
(2)	Divide training set A₁ into d₁, d₂, d₃,…,d_K by means of K-fold cross-validation, with 1 copy reserved as a test set each time, the remaining K-1 copies used as training sets, with a model obtained through training R₁, R₂, R₃,…,R_K;
(3)	Calculate the test errors ε₁, ε₂, ε₃,…,ε_K on the model R₁, R₂, R₃,…,R_K, respectively, and get the cross-validation error after averaging, perform the final quality evaluation test on the test set S to obtain test errors ε′, and ensure the cross-validation error ε and test error ε′ are smaller than the threshold by repeating n rounds;
(4)	Randomly extract a small part of data B′ from a large number of unlabeled corpus B;
(5)	Integrate R₁, R₂, R₃,…,R_K submodels into a joint-voting model with the same weight and tag the randomly selected data B′ by means of voting. When the consistency rate of the pseudolabel prediction results of K models is greater than 80%, the tags are regarded as positive tags, with the labeling results added to the training set A and deleted from B, the iterative learning of this round completed.
(6)	Repeat steps (1)-(5) above and training is completed until the unlabeled set B is empty or the maximum number of iterations N is reached.
	Output: the continuously expanding training set A′ and the named entity recognition model R of joint voting