Training Method and Device of Chemical Industry Chinese Language Model Based on Knowledge Distillation
Algorithm 1
Distillation to train student models and application of trained student models to downstream text classification tasks.
Input: Training data x, the trained teacher model Bert and its corresponding label
#Initialization
Randomly initialize student parameters
#Starting model distillation
While not converge do
For batch data set of x
Extracting word embedding layer loss, intermediate layer loss, and prediction layer loss from the teacher model
Student models begin to distill to learn teacher model knowledge
Minimizing loss function
Update parameters according to gradients
End for
End while
# Input x into the student model BiLSTM to obtain the output as a sequence of word vectors and feed it into the constructed network for the downstream text classification task
While not converge do
For batch data set of x
Returns the maximum value in the corresponding dimension, which is the prediction of the classification result
End for
End while
Output: prediction results
Remark: is hyperparameters, h represents the layer in the student model