Abstract

Recent advances in pretraining language models have obtained state-of-the-art results in various natural language processing tasks. However, these huge pretraining language models are difficult to be used in practical applications, such as mobile devices and embedded devices. Moreover, there is no pretraining language model for the chemical industry. In this work, we propose a method to pretrain a smaller language representation model of the chemical industry domain. First, a huge number of chemical industry texts are used as pretraining corpus, and nontraditional knowledge distillation technology is used to build a simplified model to learn the knowledge in the BERT model. By learning the embedded layer, the middle layer, and the prediction layer at different stages, the simplified model not only learns the probability distribution of the prediction layer but also learns the embedded layer and the middle layer at the same time, to acquire the learning ability of BERT model. Finally, it is applied to the downstream tasks. Experiments show that, compared with the current BERT model distillation method, our method makes full use of the rich feature knowledge in the middle layer of the teacher model while building a student model based on the BiLSTM architecture, which effectively solves the problem that the traditional student model based on the transformer architecture is too large and improves the accuracy of the language model in the chemical domain.

1. Introduction

The past years have seen several major breakthroughs studies that are made in pretrained language models (PLMs) (BERT [1], XLNet [2], RoBERTa [3], SpanBERT [4], and ALBERT [5]). While the learning ability of the current PLMs has been improved a lot significantly, they often have hundreds of millions of parameters, and high computational power should be required. So, it leads to the current PLMS being difficult to apply to solve problems in real life. According to the current research on PLMS, training a large and complex language model still brings great performance on many tasks.

The trend toward bigger models has become inevitable but caused some social concerns. Among them, the most typical one is the BERT model which caused a sensation in the whole NLP world at that time. It had 300 million parameters. The BERT base model is trained on 4 cloud TPUs (16 TPUs in total). BERT large trains on 16 cloud TPUs (64 TPU chips in total). Each pretraining lasts for 4 days. It follows that the training of the BERT has a high requirement for calculation force and memory. While these PLMS applications are used in real-time operations on devices, which may bring better services, the growing computational and memory requirements of these models may hamper wide adoption.

Many researches showed that the domain-specific pretraining language model can perform better in domain tasks. A large number of corpus and reasonable model structures can make the model better improve its learning ability [6]. The chemical industry belongs to the basic economy of our country and is one of the pillar industries of our country. At the same time, the chemical industry plays an important role in China’s economic growth. On March 11, 2019, the International Council of Chemical Associations (ICCA) released a report on the analysis of the contribution of the chemical industry to the global economy. According to the report, the chemical industry is involved in almost all production industries, and its contribution to the global GDP is estimated to be 5.7 trillion US dollars (7% of the global GDP) through direct, indirect, and induced impacts. Therefore, it is very important to create a pretraining language model in the chemical industry, to solve the text problems in the chemical industry more efficiently.

To create a language model in the chemical field, we have to compress the large model and use a large amount of corpus from the chemical field for training. In the compression, we choose the technology of knowledge distillation, which is different from the previous technology of knowledge distillation [7]. It is not only to learn from the probability distribution of the final output of the teacher model but also to learn in the embedded layer, the middle layer, and the prediction layer. Based on the framework, we propose a training method and device of the Chemical Industry Chinese Language Model based on knowledge distillation.

The main contributions of this work are as follows. (1) Traditional knowledge distillation methods on BERT models often failed to fully learn the representational capabilities of each layer of the teacher model, or to learn these, student models based on transformer architecture still needed to be used, and these student models still had a huge number of parameters. Therefore, we proposed a multilayer BiLSTM architecture for student models to fully learn the representational capabilities of the teacher model, which significantly reduced the number of student model parameters at the expense of only a small portion of performance compared to the former. (2) Nowadays, the pretraining language model is more and more huge, which is difficult to apply in real life. To solve this problem, we have constructed a lightweight multilayer BiLSTM architecture for student models to learn the representational capabilities of the teacher model, and our proposed approach combines a certain level of performance with the lightweight, which is more conducive to be applied to real industry-specific tasks. (3) There are currently no pretrained language models specifically applied to specific chemical industry domains. Based on a large chemical industry corpus, we have constructed a framework of distillation pretrained language models specifically for the chemical industry for later application to specific tasks in the chemical industry.

2.1. Pretrained Language Models (PLMs)

Recently, in the field of natural language processing (NLP), the use of language model pretraining has been improved in several NLP tasks and has been widely concerned. The previous researches on language models mainly include feature-based methods and fine-tuning methods [8, 9]. The details are shown in Table 1.

It can be drawn from Table 1 that the feature-based methods are mainly divided into three types. The first type is the context-independent word representation, mainly including word2vec [10], glove [11], and fastText [12]. The second type is sentence-level representation. For example, continuous learning for a sentence using Conceptors was proposed by Liu et al. [13], part-of-speech-based long short-term memory network for learning sentence representations was proposed by Zhu et al. [14], and learning sentence representations from explicit discourse relations was proposed by Nie et al. [15]. The third type is the contextualized word representation; the most typical one is the ELMO model. The main feature of the algorithm is that the representation of each word is a function of the entire input sentence. The specific method is to first train the Bidirectional LSTM model with the language model as the target on the large corpus and then use the LSTM to generate the word representation.

The fine-tuning method is to pretrain the language model on a large corpus without monitoring the target and use the labeled data in the domain to fit the model for subsequent applications. The BERT model is one of the models that use this method, but at the same time, although this training paradigm makes the model perform well, the consequent increase in the number of parameters and the long training time makes the model difficult to be applied to real business scenarios. To solve this problem, the article proposes an effective solution, and this method is also suitable for the recently proposed XLNET, RoBERTa, SpanBERT, ALBERT, and other models.

2.2. Knowledge Distillation

To effectively solve the problem of model oversize, we focus on model compression technology [1618], which can make the model more concise and conducive to application in real life.

The traditional understanding is that training a deep network requires a large number of connections (weights). However, the network training will lead to a high degree of parameter redundancy. The pruning of the network [1921], reducing the network connections, is a common strategy for model compression. The other direction is weight quantification. In this case, the connection weights are limited to a set of discrete values, with fewer bits representing the weights. However, most of these pruning and quantification techniques [22, 23] are performed on convolutional networks. Only some jobs are designed for specific structural information (such as deep language models [9, 2426]).

The goal of knowledge distillation is to compress a network with a large number of parameters into a compact and fast working model. This can be achieved by training the compact model to simulate soft inference to a larger model. Mirzadeh et al. [27] proposed a framework based on teacher assistant knowledge distillation. Liu et al. [28] proposed distilling structured knowledge from large networks to compact networks. For the BERT model of distillation compression, Sun et al. [29] first proposed a framework for distillation of the intermediate layers of the BERT model, which takes full advantage of the rich information in the middle hidden layers of the teacher model and encourages the student model to learn and imitate from the teacher model through multilayer distillation. Jiao et al. [30] proposed a two-stage BERT knowledge distillation learning framework that allowed the use of the above distillation in both pretraining and fine-tuning stages, resulting in richer knowledge learning. Xu et al. [31] proposed a model compression approach that gradually uses small modules to replace modules in BERT, in which only a loss function and a hyperparameter are used to be able to perform model compression without using transformer specific features for compression, which is a general practice. Fu et al. [32] introduced contrastive learning into the construction of distillation loss functions, and the model performance was improved. Feng et al. [33] solved the problem of poor distillation due to lack of data during distillation by means of cross-domain data enhancement. Chen et al. [34] proposed an extraction-then-distillation strategy that reuses the parameters of the teacher model. This strategy can be used for student models of any size, making the student model primed with certain knowledge before the distillation process begins, speeding up convergence, and improving task agnostic distillation efficiency. We studied the problem of compression of linguistic models on a large scale and proposed a training method and device of the Chemical Industry Chinese Language Model based on knowledge distillation to effectively transfer the knowledge of the teacher to the model of the student.

3. Method

In this section, we propose a training method and device of the Chemical Industry Chinese Language Model based on knowledge distillation. The algorithmic procedure flow chart of the proposed framework is shown in Figure 1.

The proposed algorithmic framework consists of a teacher model, a student model, and, most critically, a loss function to connect the two models. First, the framework trained the teacher model, and the raw corpus text was input to the teacher model for training to obtain the trained teacher model weights. Then, the framework started to perform knowledge distillation by distilling the word embedding layer loss, the intermediate layer loss, and the prediction layer loss of the teacher model, respectively, so that the three layers of loss are distilled into the corresponding three-layer BiLSTM model of the student model. Finally, the knowledge distillation ends, and a student model that has learned the performance of the teacher model is obtained. The detailed program algorithm is shown in Algorithm 1.

Input: Training data x, the trained teacher model Bert and its corresponding label
#Initialization
Randomly initialize student parameters
#Starting model distillation
While not converge do
 For batch data set of x
  Extracting word embedding layer loss, intermediate layer loss, and prediction layer loss from the teacher model
  Student models begin to distill to learn teacher model knowledge
  Minimizing loss function
  Update parameters according to gradients
 End for
End while
# Input x into the student model BiLSTM to obtain the output as a sequence of word vectors and feed it into the constructed network for the downstream text classification task
While not converge do
 For batch data set of x
   
  Returns the maximum value in the corresponding dimension, which is the prediction of the classification result
 End for
End while
Output: prediction results
Remark: is hyperparameters, h represents the layer in the student model
3.1. Teacher Model

In the teacher model BERT, the original text corpus set T after special processing is first to read and store in T′ after processing by line segmentation. The specific storage format is T’ = {d0, d1, ..., di, ...}, where di is the i-th article. di stores the collection of all the sentences from article i. di= {l0, l1, ..., lj, ...}, where lj is the j-th sentence in di, and lj= {t0, t1, ..., tk, ...}, where tk is the k-th token in lj. Next, the order of articles was scrambled, dupe_factor = 10, and then, a random mask was carried out, and 10len (di) bar samples were generated for each article. While the sampled sentence length exceeds the set maximum sentence length Lmax value, the next sentence prediction task in BERT is deleted from the beginning or at the end of one long sentence at random.

Each token in each sentence in T′ was sent into BERT’s token Embedding Segment Embeddings and Position Embeddings, respectively, and the vector encoding , the sentence encoding , and the Position encoding were obtained, respectively. The vector is obtained by adding the output of the same three dimensions.

BERT in 12 layers of the transformer is cut into 6 layers of the transformer and then will get that is input into the double transformer; BERT and BERT to teacher model covered the token of the probability distribution of mt and really covered the token vector said ms loss calculated according to the following formula, where Lt is the random mask task loss function, and then, we carry on the gradient descent optimization model for teachers.

3.2. Student Model

In the multilayer neural network model of the student model, first, T is the original text corpus set in the pretreatment of the same as the model of teachers and embedding operation, but the word vector dimensions are half the word vector dimension BERT model, the text of the pretreatment and data input to the multilayer neural network model, the model for the length of the three layers of two-way memory network, in the process of training the student model, and student model by studying the teacher model embedded in the layer, hidden layer, and middle prediction for correction of the model. The network structure of the student model is shown in Figure 2.

In the embedded layer, the specific formula for the loss calculation of the vector output of the embedded layer of BERT and the multilayer neural network of the teacher model and the student model is as follows:where MSE is the Mean Square Error, and matrix se ∈ l×d′ and te ∈ l×d embedded, respectively, the student model and the teacher said. L = 128 represents the text length entered by the model, d = 768 represents the hidden layer size of the teacher model, and d' = 200 represents the hidden layer size of the student model. In the invention, they have the same shape as the hidden state matrix. Matrix We ∈ ℝd′×d is a study of a linear transformation; it will be the student model said embedded into the same teacher’s model of state space.

In the middle hidden layer, the output of each hidden layer in the multilayer neural network of the student model and the output of the hidden layer in the transformer corresponding to the teacher model BERT are calculated with MSE mean square error, and the specific formula is shown as follows.where the matrix sh ∈ ℝl×d′ and th' ∈ l×d, respectively, are students and teachers’ network output of hidden layers; matrix Wh ∈ ℝd′×d linear transformation is learning; it will be the student model of hidden state to transform the same teacher model state space.

In the prediction layer, the probability distribution of the output of BERT’s softmax layer of the teacher model and the probability distribution of the output of the softmax layer of the multilayer neural network of the student network are calculated as cross-entropy.where sp and tp are, respectively, the logits output predicted by the student model and the teacher model (the input of the upper layer of softmax). is a logarithmic likelihood; Tem = 1 is the temperature value.

By using the above three distillation objectives, the distillation losses of the corresponding layers of the teacher model and the student model can be unified.where Lht represents the loss formula of the total middle hidden layer and sh and , respectively, represent the hidden layer of the h layer of the student model and the output of the second 2h-1 layer of the corresponding teacher model. , , and , respectively, represent the importance of different layers. The specific algorithm structure is shown in Figure 3.

4. Experiments

In this section, we evaluated the performance of this model under the comparison of different experimental models.

The experimental configuration is an AI server configured with 2× Intel Xeon 6148, 512 g memory, 4× 1.9 t SSD hard disk, raid card, 2× ten thousand network card, 8× Tesla V100 card, 2× double port 100 Gbps HCA card, 3000 W 1 + 1 redundant server power. And the framework selected for the experiment was Tensorflow1.12.0. The data set of the Chinese chemical industry in this experiment is from recruitment information of recruitment websites such as Yingcai, the original corpus dataset with a data size of 1,976,522.

4.1. Model Setup

Due to limited equipment, we only retain transformers with 6 layers in the BERT base model, and according to previous studies, we also abandon the next_sentence task to carry out implementation research. The research data shows that the BERT model with 6 layers still has good performance.

We created a multilayer BiLSTM neural network with a hidden size of 200 as a student model. For the BERT model after deletion as a teacher model, with the number of layers being 6, the size of hidden layers is 768 and the head number is 12. The layer mapping function between the teacher model and the student model is f(h)=2h-1. The student model learns at every two layers in the teacher model.

4.2. Teacher Model

Here, we take the BERT model as our teacher model and do not make any settings on the teacher model. Any large pretraining model based on a transformer can be plugged into this framework.

BERT’s model architecture is a multilayer bidirectional transformer encoder. BERT base consists of 6 layers, 12 self-attention heads, and 768-dimensional hidden state representation.

Similarly, for the comparison model settings, we changed the layers of the student model accordingly based on the 6-layer teacher model for fairness.

4.3. Student Model

We compare the following models.

4.3.1. Student Model without Distillation

We consider BiLSTM encoders with word embeddings. The last hidden state of BiLSTM is fed into softmax for classification, and the network parameters are trained by optimizing cross-entropy loss over labeled data. We use a basic tokenizer with this model that lowercases all words and splits by whitespace.

4.3.2. Student Model with Distillation

In this, we distill the aforementioned student with (soft/hard) targets and representations from the teacher. First, we fine-tune the teacher on labeled data and use it to generate the logits and hidden state representations for unlabeled instances. We train the student model end-to-end using cross-entropy loss on labeled instances as well as logit loss and representation loss on the unlabeled data. We test three different learning strategies based on a joint optimization scheme as well as two stagewise ones with gradual unfreezing of the intermediate layers.

To verify the validity of the model proposed in this paper, the improved pretraining model is applied to the text classification task. It is noteworthy that we do not compare this model with the recent MobileBERT [35], since the MobileBERT model employs the transformer block with different architectures.

We created a multilayer BiLSTM neural network with a hidden size of 200 as a student model. For the BERT model after deletion as a teacher model, with the number of layers being 6, the size of hidden layers is 768 and the head number is 12. The layer mapping function between the teacher model and the student model is f (h) = 2h-1. The student model learns at every two layers in the teacher model. The learning weight of each layer is set to , , and , respectively, which performs well for the learning of the student model.

From Table 2, We can find that, compared with the latest method, although our method did not reach the best in the two indicators of accuracy and F1 value, it also achieved the third score, which was not a big difference from the latest method, and our proposed method student model was using the BiLSTM architecture with a smaller number of parameters, which made the training speed of the model much faster at the expense of only a small part of the performance. To a certain extent, this also shows that the multilayer BiLSTM architecture can learn the performance of the transformer architecture model very well.

We investigate the effects of distillation objectives on our model learning. Several baselines are proposed including our model learning without the hidden layer distillation (no hidden layer), embedding layer distillation (no embedding layer), and prediction layer distillation (no prediction layer), respectively. The findings are shown in Table 3, which showed that the student model performance could be effectively improved when introducing three layers of loss into the distillation framework, with the model decreasing more significantly when the middle layer of distillation was removed in the ablation experiment, followed by the prediction layer and word embedding layer. To make the student model fully learn the performance of the teacher model, for the intermediate layer which contained the richest knowledge, this method selected a strategy of extracting the intermediate layers of the teacher model at intervals and distilled them to the student model, so that the student model can better characterize the learning teacher model performance as a whole. It was also tried to extract only the feature information from the shallow and deeper layers of the teacher model, but the coarse-grained features provided by a single shallow layer or the fine-grained features extracted from the deeper layers could not fully characterize the superior performance of the teacher model, thus resulting in no obvious improvement in the performance of the student model after distillation.

5. Conclusion and Future Work

We proposed a method and device for training Chinese models in the chemical industry based on knowledge distillation. Compared with the traditional distillation model which uses student models based on transformer architecture, this paper constructs a multilayer BiLSTM architecture for student models, so that the superior performance of teacher models can be fully learned using the multilayer structure while further reducing the number of student model participants. Experiments on the text classification task show that the method performed at a somewhat acceptable reduction in performance compared to the baseline model while the number of parameters was significantly reduced, which has important implications for the realistic application of the model to the chemical industry. In future work, we can further consider how to balance the relationship between the number of student model parameters and learning ability, so as to allow the model to be better applied in the industry.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was sponsored by the National Key R&D Program of China (No. 2018YFB1004904), the National Natural Science Foundation of China (61976118), the Key Project of Jiangsu Provincial Department of Education (No. 18KJA520001), Six Talent Peaks Project in Jiangsu Province (XYDXXJS-011), and Jiangsu 333 Engineering Research Funding Project (BRA2016454).