Abstract
Previous translation models like statistical machine translation (SMT), rule-based machine translation (RBMT), hybrid machine translation (HMT), and neural machine translation (NMT) have reached their performance bottleneck. The new Transformer-based machine translation model has become the favorite choice for English language translation. For instance, Google’s BERT translation model organizes the Transformer module into bidirectional encoder representations. It is aware of the users’ search intentions as well as the material that the search engine has indexed. It does not need to evaluate previous searches to comprehend what people mean, unlike RankBrain. BERT comprehends words, sentences, and complete information in the same way that we do. It achieves remarkable translation quality improvement over the other state-of-the-art benchmarks. It demonstrates the great potential of the Transformer model. The Transformer-based translation model mainly improves the performance at the cost of growing model sizes and complexity, usually requiring million-scale parameters. It is hard for the traditional computing systems to cope with the growing memory and computation requirements. However, the latest computers can easily run this model without any lag. The biggest challenge of applying the Transformer model is to deploy these models efficiently onto real-time or embedded devices. In this work, we propose a quantization scheme to reduce the parameter and computation complexity. It is of great importance to promote the usage of the Transformer model. Our experiment results show that the original Transformer model in 32 bit floating-point can be quantized to only 8 bits to 12 bits with only negligible translation quality loss. However, due to the perfect transformation of the block part, this quality loss part can easily be managed by the users. Meanwhile, our algorithm achieves to compression ratio, which is helpful to save the required complexity and energy during the inference phase.
1. Introduction
Machine translation has always been the core of natural language processing tasks. The machine translation task is a subfield of computational linguistics that needs the use of an algorithm to translate text or speech from one language into any other different language. Machine translation problems face several challenges. First, most translation algorithms can only obtain low translation quality on very long sentences. However, this model can effectively translate long and complex sentences efficiently, Second, it is hard to design an end-to-end, simple, and effective translation pipeline. The conventional translation algorithms usually require multiple complex steps, limiting the extensibility of the translation system.
The emerging deep learning (DL) models have continuously promoted the development of various tasks. They include computer vision (CV), natural language processing, and others. The breakthrough of the neural architecture design and training algorithms for DL models has significantly promoted performance improvements over traditional human-crafted algorithms. Due to this reason, the emerging neural machine translation (NMT) models have advanced the previous state-of-the-art models by utilizing advanced neural models, such as recurrent neural network (RNN), long short-term memory (LSTM) [1], and attention mechanism [2]. These neural models aim at learning the hidden mappings between given sequences via neural networks and attention mechanisms.
There has been a wide variety of NMT models proposed to improve the learning ability of the mapping between different natural languages. Kalchbrenner and Blunsom [3] model explained the translation task as a target Recurrent Language Mode. Based on this, they adopt a vanilla RNN model to realize the translation and improve the perplexity of alignment-based translation models by over 43%. However, due to the intrinsic drawbacks of vanilla RNN models (like gradient explosion), the training and fine-tuning for this model are difficult. The Stanford neural machine translation system for spoken languages is presented by the authors in [4]. They use a customized recurrent neural model as in Figure 1 that can realize the attention mechanism. Other recurrent modules, such as LSTM, can be easily incorporated into this model. The experimental results on English and German datasets demonstrate that the developed model is significantly sensitive to the order of words as well as syntax despite lacking alignments. The other strategy to improve the translation quality is to combine the recurrent sequence model with other DNN. Bensalah et al. propose the CRAN model [6], which is a hybrid model of convolutional neural network (CNN) and RNN attention-based neural network. During the model tuning phase, regularization and dropout are adopted to overcome overfitting. The language translation quality for Arabic is also improved.

As the RNN model has reached its performance bottleneck, the new self-attention mechanism [2] becomes the favorite choice for the translation system design. Machine translation models based on multilayer self-attention (dubbed Transformer) have demonstrated improved translation quality on various large-scale challenges in recent years [7–10]. For instance, the BERT translation engine [11] has a variable number of encoder layers and self-attention heads. They are constructed based on the introduced Transformer architecture [2]. BERT organizes the Transformer blocks into bidirectional encoder representations. This is used to realize the forward and backward attention. As a result, BERT achieves 4.6% to 7.7% absolute improvement over the state-of-the-art translation benchmarks. Other models, like BERTAC [12] and GPT [13], share similarities with BERT. All of these models have shown evolutionary accuracy improvement over the previous RNN- or LSTM-based models [3, 4, 6, 14]. In particular, the adoption of Transformer as the basic building block greatly benefits from its strong generalization capabilities. In this way, one can widen or deepen the network to attain more hidden and attention states. Moreover, the Transformer-based attention model is easier to train and optimize. Other optimization models, like metric learning [15], can easily be incorporated into the Transformer model, and they can finely work together.
The Transformer-based machine translation models yield promising results, yet the performance improvement comes at the high cost of the ever-growing model sizes and complexity. Hence, the biggest challenge in the application of the Transformer model is to deploy these models efficiently. The base model of BERT [11] needs 84 million weights while the weight number of the big BERT model is over 257 million. The large model size of the Transformer models incurs tremendous memory consumption and computational cost. This makes the models difficult to be deployed onto real-time or embedded devices. Therefore, reducing the parameter and computation complexity is of great importance to the promotion of the usage of the Transformer model. It will also help increase the performance gain.
The main contributions of this work are summarized as follows:(i)We focus on resolving the mentioned high-complexity and memory-intensive challenges mentioned above. For machine translation jobs, we increase the speed and efficiency of Transformer inference.(ii)Instead of using the original Transformer model in 32 bit floating-point format, we present a quantization scheme to quantize the full-precision weights into fixed-point numbers. This is helpful to reduce the bit width of stored weights, thus saving the arithmetic complexity for the inference phase.(iii)We perform experiments to evaluate the benefits of the proposed optimization schemes. We also study the impact of the proposed schemes on the final translation quality.
The rest of the research paper is organized as follows. Section 2 describes the proposed optimization approaches for Transformer to realize the English machine translation. Section 3 shows the experiment and analysis results. Section 4 concludes the paper.
2. Proposed Approach
Due to its promising performance, the Transformer model [2] has been used for various NLP tasks. In this section, we discuss how to optimize the Transformer model to improve its energy and space efficiency to fit into the computing power-limited devices.
2.1. Transformer Model
As shown in Figure 2, the Transformer model consists of two main parts, namely, the encoder and decoder modules. The positional encoding and embedding modules are among the various peripheral modules. The following sections introduce the details of these modules.

The Transformer model receives the output from the input or output embedding modules. Then, the embedding outputs are encoded by the positional encoding module. The encoded inputs are then sent to the encoder or decoder module. As we can see from Figure 2, each encoder module consists of one multiheaded self-attention (MSA), two-layer normalization (LN), and one feedforward module.
2.1.1. Positional Encoding
The sequence data are first computed by the embedding module and then become the embedded tokens before being sent to the Transformer encoder. Since the Transformer has a feedforward architecture without recurrent structures, the model itself has no sense of the order for each word when the words in a sentence are flowing through the Transformer’s encoder and decoder blocks. To resolve this issue, the positional encoding is used to add the order information to each word about its position in the sentence. The positional encoding adds the position information to the data as follows:where denotes the embedding dimension for the input data.
2.1.2. Self-Attention
The key to the recent Transformer [2] model is the self-attention mechanism. It is a type of algorithm that handles the long-term data dependence for sequence data like text or words. The input tokens are first calculated by a linear layer to generate three matrices. These are query matrix , value matrix , and key matrix . Specifically, the FC layer computes three matrix multiplications to generate matrices , , and , such that the input tokens are multiplied by three weight matrices:where represents the input tokens.
The key to the Transformer encoder is the MSA module that realizes long-range attention. After Eq. ([eq: qkv]), the obtained , , and matrices are fed into the MSA module. The data attention between the query and key matrices is obtained through calculating the dot-product such thatwhere denotes the attention scores computed by the dot-product operation. Here, , , and have the shape of , where is the sequence length while is the hidden dimension size. The output matrix with shape is converted to probability filed by a Softmax function. The attention output is generated by multiplying the output of Softmax by the value matrix .
The generated output from the MSA module is passing through the LN module with addition and normalization operations. Then, the intermediate results are sequentially processed by one FFN module and one LN module. These operations can be summarized as the following equations:where denotes the input of encoder block in the -th layer while is the output of MSA module.
2.2. Fully Quantized Transformer
Searching for efficient neural network quantization is an active field. Low bit width after quantization is beneficial for reduced complexity and memory consumption. A lot of methods have been proposed to quantize neural networks without causing performance degradation. Our proposed quantization scheme is based on uniform quantization with a constant step size between two quantization values. The practical reason for choosing uniform quantization is that it can easily be implemented on most existing computing systems without additional modification. It also enables the exploitation of hardware resources more efficiently.
The first step for the quantization scheme is to choose a scaling factor as follows:where represents the quantization bit width. and denote the learnable hyperparameters that determine the quantization interval. Instead of directly using the maximum or minimum values of input , we add the learnable hyperparameters to adaptively adjust the interval.
Using the acquired scaling factor s, the input data is scaled and quantized as follows:where denotes the rounding operation to the nearest integer.
To maximize the computational speed and memory reduction gain, we choose to quantize all operations of the Transformer model. Specifically, the inputs and weights of all matrix multiplications are quantized to , bit width. We use a unified bit width of .
3. Evaluation and Analysis
In this section, we evaluate the proposed quantization algorithm through conducting experiments on a popular language translation dataset.
3.1. Dataset
We use the sentence pairs extracted from two subsets of the WMT 14 multilingual language dataset [4] to evaluate the presented model. The WMT 14 dataset was created by Stanford in 2015. As shown in Table 1, the two subsets are English-German translation and English-French translation datasets. The EN-DE dataset contains 4.5 M sentences in text file format while the EN-FR dataset contains 45 M sentences.
3.2. Experimental Setup
In this subsection, we will explain the baseline model to evaluate the validation size in the experimental setup. The explanation is as follows.
3.2.1. Baseline Model
We adopt the same model structure as BERT-base in the original work [11]. The base Transformer model uses a total of 12 layers of Transformer blocks with a hidden size of 768 and 12 self-attention heads. There are 6 encoders and 6 decoders. There are around 110 million trainable parameters. It is worth noting that the embedding matrix is missing.
3.3. Implementation
We implemented our algorithm using TensorFlow, a popular DL framework. To speed up the training process, the models are trained on a Linux server with 2 Nvidia RTX 2080 GPUs and 64 GB memory. The effective batch size of each GPU is set to 1,024 in order to stabilize the training procedure. The server is equipped with Intel E5 processor. The CPU’s turbo mode is enabled. Here, it should be noted that modern computers can efficiently run this model without any lag. We also use the BLEU (BiLingual Evaluation Understudy) metric to automatically evaluate the quality of machine-translated text.
3.4. Results
The experimental results are illustrated in this section. First, we apply our quantization scheme to the BERT-base model. We vary the quantization bits from 6 to 12 in order to study the impact of bit width on the translation quality. The weights of the baseline model are in 32 bit floating-point format (32 b FP). The performance comparison for different quantization bits is summarized in Table 2. It can be seen that 12 b or 10 b quantization guarantees a comparable BLEU score over the baseline. The 12 b quantization obtains a slightly higher BLEU score. As the bit width decreases, the obtained BLEU scores decline. The 6 b quantization occurs at 5.23 and 6.59 degradation on EN-FR and EN-DE datasets, respectively.
Table 3 illustrates the comparison of model size under different quantization configurations in the previous experiment. It can be observed that the adoption of a uniform quantization scheme effectively reduces the original model size from 3,356.9 GB to less than 1,300 GB. The compression ratio is at most 5.33 times. However, it should be noted that the 5.33 time compression ratio sacrifices the translation quality. It is not desirable in real application scenarios. Hence, the best tradeoff between model size and translation quality is obtained at 8 b quantization. This is because the quality degradation for EN-FR and EN-DE datasets is only 0.64 and 0.38, respectively.
4. Conclusion
The conventional translation methods, like SMT, RBMT, HMT, and NMT, had one common disadvantage; i.e., their conversion quality was very low. They were unable to convert lengthy sentences. The Transformer-based machine translation model has resolved this issue. Another biggest challenge to this model was how to compress and optimize the Transformer model to efficiently realize inference. This issue was resolved with the help of positional encoding and a self-attention scheme. In this work, we focus on adopting uniform quantization methods to reduce the Transformer model size at the cost of negligible performance degradation. Compared to previous work, the proposed approaches can be easily incorporated within existing hardware architecture. They can quantize 32 bit floating-point into 8 or 12 bits with a slight quality loss. Interestingly, the latest computers are able to run this model without any lag. We also evaluate the performance gain of using the proposed algorithm on the WMT 14 dataset. The future work will be directed toward applying the proposed methods to existing self-learning or contrastive learning approaches [15] to further improve the performance.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The author declares that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by teaching and research project of Hubei University of Education in 2021 (no. X2021020).