Scientific Programming

Research Article

Training Method and Device of Chemical Industry Chinese Language Model Based on Knowledge Distillation

Distillation to train student models and application of trained student models to downstream text classification tasks.

	Input: Training data x, the trained teacher model Bert and its corresponding label
	#Initialization
	Randomly initialize student parameters
	#Starting model distillation
	While not converge do
	For batch data set of x
	Extracting word embedding layer loss, intermediate layer loss, and prediction layer loss from the teacher model
	Student models begin to distill to learn teacher model knowledge
	Minimizing loss function
	Update parameters according to gradients
	End for
	End while
	# Input x into the student model BiLSTM to obtain the output as a sequence of word vectors and feed it into the constructed network for the downstream text classification task
	While not converge do
	For batch data set of x

	Returns the maximum value in the corresponding dimension, which is the prediction of the classification result
	End for
	End while
	Output: prediction results
	Remark: is hyperparameters, h represents the layer in the student model