Human Behavior Modelling in Engineering Management under Industry 4.0View this Special Issue
A Cooperative Lightweight Translation Algorithm Combined with Sparse-ReLU
In the field of natural language processing (NLP), machine translation algorithm based on Transformer is challenging to deploy on hardware due to a large number of parameters and low parametric sparsity of the network weights. Meanwhile, the accuracy of lightweight machine translation networks also needs to be improved. To solve this problem, we first design a new activation function, Sparse-ReLU, to improve the parametric sparsity of weights and feature maps, which facilitates hardware deployment. Secondly, we design a novel cooperative processing scheme with CNN and Transformer and use Sparse-ReLU to improve the accuracy of the translation algorithm. Experimental results show that our method, which combines Transformer and CNN with the Sparse-ReLU, achieves a 2.32% BLEU improvement in prediction accuracy and reduces the number of parameters of the model by 23%, and the sparsity of the inference model increases by more than 50%.
Machine translation, an essential branch of computational linguistics, is a process of translating source language into the target language by computer. Translation has extremely high requirements for translators, and at the same time, there is a lack of professional translators, so machine translation has made significant progress in international exchanges . In recent years, deep learning technology has developed rapidly. Researchers have introduced neural network into language model, which can better process the representation of common and rare words. For example, a recurrent neural network (RNN) can adapt to any sentence length and process the context recurrently to get the final result. Transformer applies the attention mechanism to machine translation and has better translation quality than traditional methods.
Compared with traditional statistical machine translation, which requires elaborate features, the flexibility of existing machine translation based on neural networks is greatly improved. Methods based on RNN and its derived models such as GRU and LSTM need to learn the long-distance dependencies of each input word vector. The principle is to use the embedding layer to map sentences to the embedding space and then use the hidden layer to compute the knowledge obtained in the previous step. As multiple hidden layers compute sequentially, calculations within a single hidden layer are executed sequentially and cannot be carried out in parallel. Different from the scheme of the RNN model that continuously accumulates input information, the Transformer network uses the Encoder-Decoder structure.
Because of its stacking self-attention layer and point-by-point full connection layer, the recursive structure in RNN is eliminated, and the network based on Transformer has the advantage of high parallelism. Transformer offers significant improvements to machine translation, but at the cost of a large number of parameters. The number of BERT large parameters is 334M, the number of BERT base parameters is 109M, and the number of IB-BERT large parameters is 293M. Due to its large number of parameters and low sparsity of parameters, it is generally applied to the server side, and there is no suitable edge side Transformer algorithm. The existing RNN model must wait for all previous input processing to be completed before processing the next input, which is a bottleneck when processing long sequences. The number of operations required by the RNN model to correlate information from two arbitrary input or output positions increases as the position distance increases. This makes extracting complicated dependencies between faraway positions more difficult. Therefore, RNN is difficult to parallel, which is not conducive to hardware acceleration, and the translation effect is not ideal. On the other hand, the hardware-implementation-friendly CNN can not effectively process location information and is not effective in machine translation tasks when applied alone.
It is an effective method for hardware deployment of the neural network to reduce parameter storage and transmission amount and reduce the dependence on hardware data transmission bandwidth by using the compressed sparse matrix method. And it has little influence on algorithm accuracy. The premise of this scheme is that the sparsity of algorithm parameters is high enough. The weight parameters of the Transformer have not been optimized for sparsity, which is difficult to be applied to hardware accelerated by a compressed sparse matrix [2, 3]. The traditional ReLU activation function adopted by Transformer models such as  does not improve the sparsity of the neural network algorithm to the maximum extent. An appropriate activation function is an important measure to improve network performance and reduce the number of network parameters.
The hardware deployment machine translation algorithm must meet the requirements of lightweight and maintain high-precision translation results. At the same time, in order to further reduce the difficulty of deployment on edge devices, the algorithm optimization must improve the sparsity of weight parameters, so as to use the sparse matrix compression method for hardware deployment. To solve the above problems, this research proposes a new activation function that can improve the algorithm accuracy and parameter sparsity at the same time and designs a cooperative machine translation algorithm combining CNN to extract local features and Transformer to process sequence information. Our method combined with Sparse-ReLU improved the BLEU score of the algorithm to 35.24, increased the sparsity by more than 150%, and controlled the total number of parameters within 38M. Our main contributions are as follows:(1)A new activation function, Sparse-ReLU, is proposed and applied to the machine translation model. The BLEU score of the IWSLT14 German-English translation task is enhanced from 34.29 to 35.16 by using the model whose parameter scale is 36.42 M. The number of parameters has been reduced, and more than 50% of sparsity has grown. Meanwhile, Sparse-ReLU can improve the translation effect.(2)A Transformer structure with low number of parameters is proposed, which only uses three attention heads and a 7-layer encoder and decoder. The number of parameters of this structure is only 36.42 M, which solves the problem that Transformer is too large to be deployed on the hardware.(3)A CNN structure for machine translation tasks is proposed and combined with a Transformer to optimize the network. The number of parameters of the overall algorithm is 37.99 M. The BLEU score is increased from 35.16 to 35.24.
1.1. Related Works
The machine translation algorithm based on the neural network generally adopts the Encoder-Decoder model to deal with the machine translation task . The encoder takes the source sentence as input and calculates a real expression value. The decoder inputs the real expression value and generates the target translation. CNN, RNN, and Transformer are the classical algorithms for constructing the Encoder-Decoder structure.
Machine translation jobs can be processed serially using an approach based on RNN and its derivatives LSTM [6, 7] and GRU . It has the advantage of high extraction ability in processing series information. For example, the RNN-based algorithm [9, 10] generates dynamic context representation through its Encoder-Decoder architecture based on attention mechanism. Research  creates target phrases with fixed source statement representation. To make the RNN and its derivative networks deeper and better,  employs a residual strategy and skip connections to further the RNN development.
The Transformer based algorithm  and its variants [5, 13–16] achieve the most advanced results on multiple language pairs only based on the attention mechanism. Research  improves the effect by increasing the scale of model parameters. Still, increasing the number of parameters means that more extensive data sets are needed, and the training is more complicated. It is not suitable for hardware, especially edge devices. CNN-based algorithms [17, 18] are concerned because of their high parallelism. Among them,  proposes a CNN-based machine translation algorithm with higher parallelism and a shorter long-term dependency than RNN.
Machine translation projects employ a variety of Transformer structures to optimize the size of the model and precision, as well as the bandwidth required for hardware deployment. Research  proposes a pruning algorithm to increase model sparsity and deploy the model on GPU, and research  proposes a sparse matrix calculation method. Both of them reduced the bandwidth requirements of matrix calculation on hardware. One is to improve the activation functions such as the ReLU and SoftMax. Research  introduces a novel activation function WReLU for lightweight neural network design. Research  introduces a random calculation method to replace the traditional SoftMax calculation, which reduces the calculation complexity and improves the speed. Research  is a collaborative processing scheme that combines the advantages of multiple networks, and it combines BiLSTM and recurrent attention for machine translation tasks.
These works have effectively promoted the development of machine translation. However, most of the existing Transformer schemes dealing with machine translation tasks only use the attention mechanism and lack the research results combined with the CNN model. Most optimized networks are still too large, and there are defects in input sequence order when using the RNN model to process sequence information. The effect of processing sequence information using the CNN model is not ideal, making them challenging to deploy in edge devices. Based on these, the new activation function Sparse-ReLU, CNN submodel structure, Transformer submodel structure, and cooperation scheme proposed in this research achieve a better effect under a particular parameter scale condition.
Unlike prior machine translation Transformer algorithms, this study proposes a Transformer model with few parameters, a CNN submodel, and a novel activation function Sparse-ReLU. CNN and Transformer submodels use Sparse-ReLU to optimize the effect. The three of them cooperate in dealing with machine translation tasks. CNN can process local information of word vectors and extract multiple features containing position-coding, and the attention mechanism can process local features extracted by CNN rather than input sentences. Figure 1 shows the process of our algorithm.
Figure 1 depicts the translation process. The input and output of the algorithm model are symbol sequences, and the word segmentation operation of the input symbol sequence uses the BPE word segmentation method. The position-coding operation embeds the position information into the symbol sequence obtained by word segmentation. The CNN submodel extracts the features of sentences containing location coding information. The Transformer submodel further extracts the output information of the CNN submodel. Both submodels use Sparse-ReLU.
2. Model Structure
2.1. Activation Function That Can Improve Sparsity and Accuracy
Most neural network algorithms currently require activation functions to introduce nonlinear operations. However, activation functions such as sigmoid have the disadvantages of complex hardware implementation and high resource consumption in algorithm deployment. When implementing the algorithm in hardware, sparse matrix acceleration is a viable option, and sparse matrix acceleration necessitates high weights sparsity. The ordinary activation function can not maximize sparsity matrix compression technology. Based on this, this research proposes a new activation function Sparse-ReLU for hardware optimization. The activation function has the advantages of low hardware implementation cost and low computing time like the traditional ReLU function. It solves the disadvantages of limited representation ability and insufficient flexibility of the conventional ReLU function.
For machine translation and other applications in natural language processing, real-time computing requirements are high. Because of its low power consumption and small area, the edge devices cannot effectively deploy translation algorithms with huge parameters. In particular, the Transformer algorithm, although the translation effect is good, can only be deployed on the server-side. It is of great significance to realize the activation function with high flexibility and simple hardware implementation.
The activation function designed in this research can improve the sparsity and accuracy of neural networks. The expression is shown in formulas (1)–(4), in which the range of parameter a is (0, 1), the content of b is (0,1), which satisfies a < b and c < d, and both a and b need to meet the relationships of 0.125x, to complete the multiplication calculation only through the hardware shift operation. As shown in formula (1)–(3), the three subfunctions of Sparse-ReLU are , , and .
The function of the subfunction shown in Formula (1) is to set the activation value of the neural network to 0 and improve the sparsity of its activation value. The traditional ReLU function sets the input of all negative numbers to 0 and the positive part to itself. The function in this research puts the output to 0 no matter what the input is. However, doing so will cause all information to be lost, so formulas (2) and (3) are required to retain the information. Taking the maximum value of the three subfunctions can obtain the formula of Sparse-ReLU.
When determining parameters a, b, c, and d, the functions with large c and d are preferred under the same translation effect, improving the sparsity of the weights and activations of neural network. When setting the parameters of Sparse-ReLU to a = 0.25, b = 1, c = 0.2, and d = 0.4 in Figure 2(a) , the translation effect of the activation function is the best. According to the formula, the traditional ReLU function is a subset of the activation function designed in this research. Figure 2(b) is the image of setting parameters a = 1, b = 1, c = 0.2, d = 0.2 to make Y1 = Y2 in Sparse-ReLU. In this case, Sparse-ReLU degenerates into an offset traditional ReLU function. Furthermore, ReLU is a subset of Sparse-ReLU, and the characterization power of Sparse-ReLU is higher.
In this research, Sparse-ReLU replaces the traditional function to provide nonlinear characteristics for the network. It is applied between two fully connected layers or between the CNN layer and the fully connected layer in the network. Parameters a, b, c, and d in Sparse-ReLU are found through training, to make the effect of the network model better than that of the traditional ReLU activation function. According to the iterative experiment, the German-English translation task of IWSLT14 dataset performs best when the parameters are a = 0.25, b = 1, c = 0.1, and d = 0.4. After determining parameters in Sparse-ReLU, combined with the pruning operation, the sparsity of the network model is improved. Unlike other sparsity improvement methods, this activation function can enhance the sparsity of weight parameters and the sparsity of activation, which is convenient for further hardware acceleration using a sparse matrix.
Because it can efficiently use shift and addition operations to realize the activation function in hardware and complete the prediction task of the model based on Sparse-ReLU, Sparse-ReLU can obtain multiple zero values, in which the experimental statistics are more than 50% higher than the value close to zero in the ordinary ReLU activation value. So, it can effectively improve the network sparsity. It has the characteristics of low resource consumption in hardware implementation and can enhance the sparsity of neural network parameters and the accuracy of the model prediction. Figure 3 shows the application of Sparse-ReLU in the model, as discussed in Section 1.2.
2.2. Transformer and CNN Submodels Combined with Sparse-ReLU
Figure 1 shows the overall design, which is made up of a CNN and a Transformer submodel. We will introduce the Transformer structure used in this research in detail.
The Transformer structure has a high ability to extract sequence information. Figure 4 shows the Transformer structure in this research, composed of the residual layer, multihead attention layer, Sparse-ReLU layer, and layer normalization. As shown in Table 1, the parameters of each Transformer structure are listed. It has the characteristics of low parameter quantity. The Encoder combines Sparse-ReLU, which can improve prediction accuracy.
The CNN submodel needs to process the input sentences and extract the features without changing the size of the feature map. Convolution can improve the extraction ability of local features of the network. Figure 5 shows the CNN submodel. After the sentence is entered into the model, it needs to go through embedding and position-coding operation and then carry out layer normalization operation. Finally, utilizing two-dimensional CNN to process the sentence coding results containing position information. Figure 5 shows the CNN structure, where Len is the sentence length.
The input channel of CNN is 1, the output channel is the length of the word vector, that is, 512 channels, the size of convolution kernel is (5,512), the stride step is 1, and the size of padding is two zeros for rows and no padding for columns. The dimension of the feature map before CNN processing is (Len, 512), Len is the sentence length, and the dimension of the feature map after CNN is (512, Len). After dimension transformation and Sparse-ReLU, the fully connected layer maps the result to another dimension and makes the residual connection with the original input. Table 2 lists the detailed measurements of each structure of CNN.
2.3. Collaborative Processing Scheme between CNN and Transformer Submodels
Considering the strong ability of CNN to extract local features, the combination of Transformer and CNN submodels can effectively improve the power of the algorithm to extract local features. There are a variety of cooperative processing strategies for multinetworks, including the concatenation operation method , result addition method , result point multiplication method , and matrix transformation after the concatenation operation. The model in research  adopted the method of submodules in series, effectively combined CNN and attention, and proposed an end-to-end ResNet structure model, which was used to extract local features, and summarized the local feature sequence through the attention mechanism. This research discusses the impact of CNN and Transformer on machine translation tasks. Figure 6 shows the details of various collaborative processing schemes discussed in this research. The CNN submodel of Figure 6 is the structure in Figure 5. Encoder and decoder are also the forms discussed in Figure 4.
To make the size of the matrix output of the CNN submodel be the same as that of the attention submodel, we set the number of output channels of the convolution kernel to be 512, which is the same as the length of the word vector in the matrix output of attention submodel. Figure 6(a) shows the scheme that the attention submodel learns the feature of input data, and the feed-forward module consisting of a CNN and a fully connected layer processes the result of attention computation, which is illustrated with the gray box. As shown in Figure 7, the output characteristic matrix of the feed-forward module is the result of the addition of a CNN and a fully connected layer. Figure 6(b) shows how to add a CNN submodel after the multihead attention layer of the encoder in the Transformer. Figure 6(c) uses the attention submodel to summarize the feature sequence from the original sentence, then uses a CNN submodel with a ResNet structure to extract local features from the features summarized by the attention submodel, and finally uses the decoder submodel to decode the target text. Figure 6(d) shows that the CNN submodel first processes the input sentence and learns the local features of the sentence. The Transformer submodel extracts the sequence features processed by the CNN submodel. After the encoder operation in the Transformer, it is handed over to the decoder submodel for decoding operation again.
2.4. Algorithm Model
Figure 8 shows the details of the model in this research. According to the introduction of the previous three sections, the model in this research mainly includes the convolutional neural network feature extraction layer, encoder layer, and decoder layer. Table 3 shows the structural parameters of the algorithm.
3. Experiment and Result Analysis
3.1. Machine Translation Dataset and Word Segmentation Algorithm
This research selects the German-English translation task for the experiment, and the dataset is IWSLT14. The training set contains 160250 sentences, and the testing set uses 6750 independent sentences .
The input and output of the translation algorithm are symbol sequences. These symbols are the basic units of sentences. Because extensive vocabulary cannot be naturally decomposed into words, using words as the basic units to form sentences will make it challenging to train the algorithm. An alternative is to use word segmentation algorithms such as [19, 25] to learn subwords from the dataset. This research uses the method of research  for word segmentation, which introduced the BPE algorithm variant for word segmentation, which can encode the available vocabulary with the vocabulary of variable length subword units. In this experiment, the size of the word table is 10000.
3.2. Evaluation Metric
Many automatic evaluation metrics have been proposed in the machine translation task to evaluate the quality of translation results. This research adopts the most popular BLEU  evaluation. It summarizes the overlapping words and phrases between machine translation and reference results. The translation results judged by the BLEU evaluation metric are highly consistent with those considered by human beings and have become a de facto translation evaluation standard after being proposed. This research does not use single testing set for BLEU score evaluation but combines multiple testing sets for score evaluation.
3.3. Experimental Environment and Model Training
This experiment uses 8 Titan XP graphics cards. PyTorch version is 1.4.0, and the CUDA version is 10.2. Table 4 lists the detailed configuration used in the experiment.
The algorithm uses a dropout operation to prevent training overfitting to ensure the training quality . We use the dropout operation before the layer normalization of the CNN submodel, multihead attention submodel, encoder and decoder submodel output, and final output layer. Table 3 lists the network parameter configuration. The step of warmup is 8000. In the prediction process, a beam search algorithm is used instead of a greedy algorithm to obtain better prediction results, where the beam size is 5.
3.4. Result Analysis
3.4.1. BLEU Score Comparison
This model is being used to test German-English translation tasks. Table 5 lists the translation results of this algorithm and expected results.
Table 6 lists the comparison between the translation results of various types of machine translation algorithms and the algorithms in this research. The parameter size of the algorithm proposed in this research is 37.99 M, and the BLUE score reaches 35.24, which is 52.554% higher than other schemes such as research , 17.860% higher than research , and 2.442% higher than research . The score of the algorithm model in this work is enhanced by 2.323% compared to the classic Transformer model , and the parameter size is decreased by 11.28 M, which is reduced by 23%. Compared with Dynamic CNN , the score of the proposed algorithm is 1.264% higher, and the parameter size is decreased by 247.01 M, which is reduced by 87%. Compared with the Transformer using the ReLU, the score of the Transformer using Sparse-ReLU is improved by 0.87 scores from 34.29 scores to 35.16 scores, an increase of 2.54%. The score of the optimization result using Sparse-ReLU and CNN is 35.24, an increase of 2.77%.
According to the four cooperation schemes between CNN and Transformer designed in Section 2.3, Table 7 lists the translation results obtained by the seven structures. The CNN structure of structure 2 is the one-dimensional CNN proposed in Figure 7, and the other CNN structures are the two-dimensional CNN submodel structure proposed in Figure 5. The input channel is 1, and the Transformer submodule is the structure proposed in Section 2.2. The input and output channels are word vector lengths, and the word vector dimension is the input channel of the convolution kernel. Using structure 7, when the convolution kernel size is 5 × 512, the BLEU score achieves the best result of 35.24 points.
Figure 9 shows the comparison results of the decline curves of loss during different structure training. The solid line in the figure is the loss decline curve of the benchmark model, and the dotted line is the loss decline curve of the Transformer model with CNN and Sparse-ReLU. Compared with the benchmark model, the loss reduction speed of the model with CNN and Sparse-ReLU is much higher than that of the benchmark model. Under the same 300 epochs, the loss value is reduced by 10.6%.
3.4.2. Sparsity Comparison
Table 8 lists the effects of Sparse-ReLU and ReLU on the sparsity of the algorithm. Compared with the conventional ReLU function, the Sparse-ReLU proposed in this research increases the sparsity of the relevant layer by 150.95% and reduces the loss value by 42.2%. Both indexes have an excellent optimization effect.
When the parameters of the new activation function are set to a = 0.25, b = 1, c = 0.2 and d = 0.4, the algorithm is pruned to improve the sparsity of the parameters. Table 9 lists the effects of traditional and new activation functions on the sparsity of the model when the model adopts different pruning algorithms. Table 9 shows that, compared with L1 norm pruning, the random unstructured pruning algorithm and Sparse-ReLU jointly improve the sparsity of the weight to 78.26% and the sparsity of the activation value of the tested layer to more than 120%.
The bar graph of key information in the above two tables is shown in Figure 10. Figure 10(a) shows that the Sparse-ReLU used in this paper significantly decreases the loss value and improves the sparsity when training the model compared with ReLU. According to Figure 10(b), when the model with Sparse-ReLU uses different pruning algorithms, it can further improve the sparsity with little accuracy loss.
This paper proposed a cooperative machine translation algorithm based on CNN and Transformer submodels combined with Sparse-ReLU, where the CNN is used to extract local features of sentences containing location information, the Transformer is used to further extract sequence features, and the Sparse-ReLU can optimize the algorithm. Compared with the traditional counterpart, the count of parameters decreased by 23% with accuracy increased by 2.77%, and the sparsity increased by 50%. Consequently, Transformer and CNN parameters are only 36.42 M and 1.57 M, respectively. Test results show that the proposed scheme can effectively improve the accuracy of model translation and the sparsity of activation and weight value.
In future works, the author of this research will continue to study the collaborative scheme between CNN and Transformer and the parameter training method of Sparse-ReLU, hoping to achieve better results.
The experimental data used to support the findings of this study are included within the article. The dataset data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Xintao Xu and Yi Liu contributed equally to this article.
This research was funded by National Natural Science Foundation of China (Grant Number: U19A2080, U1936106), the CAS Strategic Leading Science and Technology Project (Grant Number: XDA27040303, XDA18040400, XDB44000000), and High Technique Projects (Grant Number: 31513070501, 1916312ZD00902201).
M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 359–371, Columbus, OH, USA, October 2019.View at: Publisher Site | Google Scholar
X. Xie, Z. Liang, P. Gu, and A. Basak, “Spacea: sparse matrix vector multiplication on processing-in-memory accelerator,” in Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 570–583, IEEE, Seoul, Korea, March 2021.View at: Publisher Site | Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008, Long Beach, CA, USA, December 2017.View at: Google Scholar
A. Fan, S. Bhosale, H. Schwenk, and Z. Ma, “Beyond English-centric multilingual machine translation,” Journal of Machine Learning Research, vol. 22, no. 107, pp. 1–48, 2021.View at: Google Scholar
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and Translate,” Computer Science, vol. 7, 2014.View at: Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112, Montreal, Canada, December 2014.View at: Google Scholar
J. Gehring, M. Auli, D. Grangier, and N. D. Yann, “Convolutional sequence to sequence learning,” in Peds of the International Conference on Machine Learning PMLR, pp. 1243–1252, Sydney, Australia, August 2017.View at: Google Scholar
Y. Liu, X. Guo, and K. Tan, “Novel activation function with pixelwise modeling capacity for lightweight neural network design,” Concurrency and Computation: Practice and Experience, Article ID e6350, 2021.View at: Google Scholar
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018, Nashville, TN, USA, June 2021.View at: Google Scholar
M. Cettolo, J. Niehues, and S. Stüker, “Report on the 11th iwslt evaluation campaign,” in Proceedings of the 2014 International Workshop on Spoken Language Translation, vol. 57, Hanoi, Vietnam, December 2014.View at: Google Scholar
T. Kudo and J. Richardson, “SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, December 2018.View at: Publisher Site | Google Scholar
K. Papineni, S. Roukos, T. Ward, and W. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the Annual Meeting Of The Association For Computational Linguistics, Philadelphia, PA, USA, July 2002.View at: Google Scholar
N. Srivastava, G. Hinton, A. Krizhevsky, and S. Ruslan, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.View at: Google Scholar
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: fast autoregressive transformers with linear attention,” in Proceedings of the International Conference on Machine Learning PMLR, pp. 5156–5165, New York, NY, USA, July 2020.View at: Google Scholar