Research Article

Training Method and Device of Chemical Industry Chinese Language Model Based on Knowledge Distillation

Algorithm 1

Distillation to train student models and application of trained student models to downstream text classification tasks.
Input: Training data x, the trained teacher model Bert and its corresponding label
#Initialization
Randomly initialize student parameters
#Starting model distillation
While not converge do
 For batch data set of x
  Extracting word embedding layer loss, intermediate layer loss, and prediction layer loss from the teacher model
  Student models begin to distill to learn teacher model knowledge
  Minimizing loss function
  Update parameters according to gradients
 End for
End while
# Input x into the student model BiLSTM to obtain the output as a sequence of word vectors and feed it into the constructed network for the downstream text classification task
While not converge do
 For batch data set of x
   
  Returns the maximum value in the corresponding dimension, which is the prediction of the classification result
 End for
End while
Output: prediction results
Remark: is hyperparameters, h represents the layer in the student model