Abstract

The Seq2Seq model and its variants (ConvSeq2Seq and Transformer) emerge as a promising novel solution to the machine translation problem. However, these models only focus on exploiting knowledge from bilingual sentences without paying much attention to utilizing external linguistic knowledge sources such as semantic representations. Not only do semantic representations can help preserve meaning but they also minimize the data sparsity problem. However, to date, semantic information remains rarely integrated into machine translation models. In this study, we examine the effect of abstract meaning representation (AMR) semantic graphs in different machine translation models. Experimental results on the IWSLT15 English-Vietnamese dataset have proven the efficiency of the proposed model, expanding the use of external language knowledge sources to significantly improve the performance of machine translation models, especially in the application of low-resource language pairs.

1. Introduction

Neural machine translation (NMT) [14] has proven its effectiveness and thus has gained researchers’ attention in recent years. In practical applications, the typical inputs to NMT systems are sentences in which words are represented as individual vectors in a word embedding space. This word embedding space does not show any connection among words within a sentence such as dependency or semantic role relationships. Recent studies [58] found that semantic information is essential to generate concise and appropriate translations in machine translation. Although these models have made a significant progress, their design and functions are limited to statistical machine translation systems only. Consequently, the tasks of surveying, analyzing, and applying additional semantic information to NMT systems have not received comprehensive attention.

In this study, we present the method of integrating abstract meaning representation (AMR) graphs (https://amr.isi.edu) as additional semantic information into the current popular NMT systems such as Seq2Seq, ConvSeq2Seq, and Transformer. AMR graphs are rooted, labeled, directed, and acyclical graphs representing the entire content of a sentence. They are also abstracted from related syntactic representations in the sense that sentences with similar meanings will have the same AMR graph, even if the words used in these sentences are different. Figure 1 illustrates an AMR graph in which the nodes (e.g., -01, ) symbolize concepts, while the edges (e.g., and ) represent the relationship between the concepts that they connect. Compared to semantic role graphs, AMR graphs contain more relationships (e.g., between and ). Besides, AMR graphs directly hold entity relations while excluding the alternating variables (i.e., using lemma) and the function words. Therefore, AMR graphs can be combined with the input text to generate better contextual representations. Moreover, the structured information from AMR graphs can help minimize the problem of data sparsity in resource-poor settings. First, the AMR graph representations are combined with the word embedding to create a better context representation for a sentence. Then, multihead attention can focus on all positions of contextual features with the outputs of the AMR graph representations.

Integrating AMR graphs into NMT yields several benefits. First, this addresses the problems of data sparsity and semantic ambiguity. Second, structured semantic information constructed from AMR graphs could help complement the input text by providing high-level abstract information, thereby improving the encoding of the input word embedding. Last, multihead attention can also take advantage of semantic information to improve the dependency among words within a sentence.

Recent studies have applied semantic representation to NMT models. For instance, Marcheggiani et al. [9] exploited the semantic role labeling (SRL) information for NMT, indicating that the predicate-argument structure from SRL can help increase the quality of an attention-based sequence-to-sequence model. Meanwhile, Song et al. [10] proved that semantic information structured from AMR graphs can complement input text by incorporating high-level abstract information. In this approach, the graph recurrent network (GRN) was utilized to encode AMR graphs without breaking the original graph structure, and a sequential long short-term memory (LSTM) was used to encode the source input. The decoder was a doubly attentive LSTM, taking the encoding results of both the graph encoder and the sequential encoder as attention memories. Song et al. had also argued that the results of an AMR integration is significantly greater than those of a sole SRL integration because AMR graphs include both SRL and the relationships between the nodes (i.e., words). However, Song’s approach has encountered some drawbacks such as failed to address the problem of the correlation between nodes in AMR graphs and investigated only on the machine translation system using the recurrent neural network (RNN).

The contributions of our work are as follows:(i)First, instead of adding a node to represent an edge in the graph and assigning properties of the edge as those of the documents, we extend the node embedding algorithm [11] to use direct edge information(ii)Second, instead of using the graph recurrent network in [10], we propose an architecture that binds an inductive graph encoder(iii)Finally, we examined and analyzed the results on the English-Vietnamese bilingual set, which is considered a low-resource language pair. Through experiments, we demonstrate the effectiveness of integrating AMR into neural network machine translation and draw insightful conclusions for future studies.

The organization for the remaining of the article is as follows. Section 2 introduces current popular machine translation architectures such as Seq2Seq, ConvSeq2Seq, and Transformer. Next, Section 3 presents the method of representing AMR graphs in the vector form as well as proposing a method to integrate AMR graphs into different NMT models. Then, Sections 4 and 5 discuss the corpus used in the experiment and the experimental configuration for the model, respectively. Afterward, Section 6 presents the experimental results of the machine translation model with an integrated AMR and analyzes the effect of an AMR on the model along with some translation errors generated by the model. Section 7 summarizes our work.

2. Neural Machine Translation

In this section, we provide a brief introduction about the Seq2Seq model and its variants such as ConvSeq2Seq and Transformer.

2.1. Seq2Seq

We take the attention-based sequence-to-sequence model of [1] as the baseline model, but we use LSTM [12] in both encoder and decoder.

2.1.1. Encoder

Given a sentence, .

(i) Uni-LSTM. As usual, the RNN reads an input sequence in order starting from the first token to and computes a sequence of hidden state to generate input representation from left to right.

(ii) Bi-LSTM. Consists of forward and backward LSTM’s. The forward LSTM works similar to Uni-LSTM and the backward LSTM reads the sequence in the reverse order from the last token to , resulting a sequence of backward hidden states . We obtain the word embedding by concatenating the forward and backward hidden state, .

2.1.2. Decoder

The decoder predicts the next word , given the context vector and all previously predicted words . We used an attention-based LSTM decoder [1], with attention memory as the concatenation of the attention vectors among all source tokens.

For each decoding step , the decoder feeds the concatenation of the embedding of current input and previous context vector into LSTM to update the hidden state:

Then, the new context vector is computed aswhere a is the alignment model which is a feed forward network, scores how well the inputs surround position , and the input at position match.

The output probability over target vocabulary is calculated:where and are the model parameters.

2.2. ConvSeq2Seq
2.2.1. ConvSeq2Seq

This architecture is proposed by Gehring et al. [2] to completely replace the RNN with the CNN with the following components:

The ConvS2S model followed the encoder-decoder architecture. Both encoder and decoder blocks share an identical structure that computes hidden states based on a fixed number of input elements. To enlarge the context size, we stack several blocks over each other. Each block comprises a one-dimensional convolution and a nonlinearity. In each convolution kernel, parameters are . The input is represented as , which is a concatenation of input elements with dimension of and maps them to get the single output with dimension twice of that of the input. Then, the output elements will be fed into subsequent layers. We leverage the gated linear unit (GLU) as nonlinearity which applied on the output of the convolution :where are the inputs to the nonlinearity, denotes the element-wise multiplication, the output has a half of size compared to , and is the gate that control which inputs of the current contexts are relevant.

In order to enable deep convolutional blocks, we adopt the residual connections which connect the input of each convolutional layer with the output:where is the hidden state of layer.

2.2.2. LightConvSeq2Seq

As a variant of a CNN called lightweight convolution [13] which allows computation with linear complexity, , with being the length of the input string.

The structure of LightConvSeq2Seq consists of the elements similar to Conv2Seq but using lightweight convolution operation rather than convolution operation.

Depthwise Convolution (DConv). Perform a convolution operation independently over every channel; thereby, the number of parameters reduce significantly from to , where is the kernel width. In general, at position and direction , the output is calculated as follows:

2.3. Transformer

Transformer [4] also includes an encoder and a decoder. The encoder generates a vector representation of the input sentence. Assuming an input of the form and a representation of of the form , the decoder produces sequentially for a translation of based on and the previous outputs.

2.3.1. The Encoder

There are N stacked similar blocks. Each of these blocks consists of 2 subblocks: a self-attention mechanism and a feed forward network. A residual connection surrounds each subblock, followed by layer normalization. The general representation formula for the encoder is as follows:

2.3.2. The Decoder

There are also N blocks. However, each block consists of 3 subblocks: a self-attention block, a feed forward block, and an encoder-decoder attention block inserted between them. The residual connection and layer normalization are used similarly to the encoder. The encoder generates outputs step by step. The self-attention block only pays attention to the positions generated in the previous steps by using a mask. The mask prevents the decoder from paying attention to locations that have not been generated, so outputs can only be predicted based on the result of the encoder and previous outputs.

2.3.3. Self-Attention

There are 3 components as follows: query (Q), key (K), and value (V), defined as follows:where are the parameters with the number of dimensions respectively.

3. The Proposed Method

In this section, we present the graph embedding algorithm and propose our method to integrate the AMR graph embedding representation to various well-known NMT systems such as Seq2Seq, ConvSeq2Seq, and Transformer.

3.1. Graph-Level Information Representation

Figure 2 depicts the graph encoder architecture based on the model of Xu et al. [11], with some enhancements to integrate more information about the edge of the graph.

The directional graph with the label on the edge presents the relationship between the nodes and to which it connects. The process of learning to represent the node is as follows:(1)We first transform the text attribute of node into a feature vector by looking up the embedding matrix (2)Next, we categorize the neighbors of into two subsets: forward neighbors, and backward neighbors, . Particularly, returns the nodes that directs to and vice versa. The information about the edge associated between the node and the adjacent node is combined as follows:(3)We aggregate the forward information of ’s forward neighbors into a single vector, , where is the iteration index. We do this by using one of three mentioned.(4)Then, we concatenate ’s current forward representation, , with the new neighborhood vector, . The result is passed to a feed forward layer, followed by a nonlinearity activation function , which updates the forward representation of , to be used in the next iteration.(5)Update the backward representation of , , using similar procedure in steps (3) and (4), but this time, we utilize backward representations rather than the forward representations and use to aggregate neighbor information.(6)Repeat steps (3) (5) times, and the concatenation of the final forward and backward representation is used as the final bidirectional representation of .

As mentioned in steps (3) and (5), the representation association operation of node is performed with one of the following aggregation functions:(i)Mean aggregator: performs the average calculation on each element of (ii)GCN aggregator: it is quite similar to mean aggregator, except that the result is fed into a fully connected layer and a nonlinear activation function [14].with MEAN as the function returning the average value, and as the nonlinear activation function.(iii)Pooling aggregator: each node embedding vector is passed through a feed forward layer followed by the pooling operation (which can be max, min, and average):with max as the maximum operation, and as the nonlinear activation function.

3.1.1. Graph Embedding

Graph embedding Z contains all the information on the graph and is calculated by one of the following two methods:(i)Pooling based: the node embeddings are passed through a linear transform network and performs pooling.(ii)Adding a super node: node is pointed by all nodes in the graph. Using the algorithm in Section 3.1, the representation of is . The representation of contains all information of the nodes that should be considered as representations of the graph or graph embedding.

3.2. Dual Attention Mechanism

The architecture of an integrated AMR machine translation model is illustrated in Figure 3 with an English sentence input and a corresponding AMR graph. The proposed architecture consists of an encoder for the input sentence and a decoder with the input value resulting from the encoder. The main difference from the traditional decoder-encoder model is that there is an additional graph encoder to process information on graphs and to represent this information in a vector format. This vector is then combined with the hidden states of the encoder and fed into the decoder to find the corresponding representation in Vietnamese.

We propose a specific integration method for the Seq2Seq model with sequential processing in Section 3.2.1 and focus on models with parallel processing such as ConvSeq2Seq, LightConvSeq2Seq, and Transformer in Section 3.2.2.

3.2.1. Seq2Seq Model with the Sequential Processing Mechanism

The model (Figure 4(a)) consists of two attention mechanisms operating independently: the original attention (left) learns the alignment between the result and the hidden states of the encoder and the graph attention learns to align between the output and the nodes in the AMR graph, yielding a context vector . In particular, the computation of is as follows:where a is a feed forward network, evaluating the matching between the nodes surrounding the position and the input .

These two context vectors are then combined with the decoder’s state and the embedding vector of to calculate a probability distribution that determines .

3.2.2. Models with a Parallel Processing Mechanism

On the contrary, with parallel processing, the model has no information about the state of the decoder. In other words, except , no information about the graph is included in the calculation of attention. Besides, using only the states along with the parallel computation leaves the model with no information about the association between the output and the AMR graph in step . Consequently, the model cannot effectively learn the connection between the input sentence, the output sentence, and AMR graph, with a small increase of about 0.2 (experiments with LightConvSeq2Seq and Transformer). Therefore, the use of the graph embedding of Z should help the model obtain more information about the graph before the attention calculation. This has been proven with experimental results, which show an increase of the BLUE score by 0.6.

Figures 4(b)4(d) describe the proposed model that integrates AMR with a dual attention mechanism. Regarding the LightConvSeq2Seq-AMR and Transformer-AMR models, the self-attention mechanism for the graph is similar to the description of the self-attention mechanism in Section 2.3 with the input being representations of nodes instead of the state . Regarding the ConvSeq2Seq-AMR model, experimental results show that utilizing Luong’s attention mechanism to learn the alignment between the graph and the output produced better results than the multistep attention.

4. The Corpus

The corpus used to evaluate the model is IWSLT15 [15], which includes approximately English-Vietnamese bilingual sentences taken from TED Talks presentations for the training set. For fine-tuning, we use the set called tst2012, which includes 1553 parallel pairs language. Besides, the test sets consist of tst2013 and tst2015, which include 1268 and 1080 English-Vietnamese bilingual pairs, respectively. The statistical information is given in Table 1.

For the preprocessing phase, byte-pair encoding (BPE) (https://github.com/rsennrich/subword-nmt) [16] with 8000 operations is utilized to deal with rare words and compound words for both English and Vietnamese, thereby significantly reducing the vocabulary size in English from 54111 to 5208 and in Vietnamese from 25335 to 3336.

For AMR parsing, we use NeuralAmr toolkit (https://github.com/sinantie/NeuralAmr) [17] which implements the sequence-to-sequence models to the tasks of AMR parsing and AMR generation. Their model achieves competitive results of 62.1 SMATCH [18], the current best score (at the time doing this work, Jan 2020) reported without the significant use of external semantic resources. This tool produces AMR graphs represented in the PENMAN notation (https://www.isi.edu/natural-language/penman/penman.html) and in a linear form, as demonstrated in the AMR preprocessing example.

5. Experimental Configuration

The models are implemented in Python 3 and use the library Fairseq (https://fairseq.readthedocs.io/en/latest/#) [19].

The configuration of the base models is as follows:(i)Seq2Seq: we investigate the MT model with two types of LSTM which are uni-LSTM (one-directional) and bi-LSTM (two-directional). There are 512-word embedding dimensions, which utilize 512 LSTM hidden units in both the encoder and the decoder.(ii)ConvSeq2Seq: it comprises 4 convolutional blocks and 512 hidden units for both the encoder and the decoder. The kernel size is 3.(iii)LightConvSeq2Seq: it consists of 4 convolutional blocks with the kernel size of 3, 7, 15, and 31 for each block and applies to both the encoder and the decoder. Self-attention is adopted with heads.(iv)Transformer: it has blocks for both the encoder and the decoder. The word embedding dim is set to 512 and 2048 for the feed forward network. Self-attention used with the number of heads was 8.

The proposed models have the same configuration as the base model. Besides, the graph encoder used 128-dimensional embedding for the representation of both edge and node. We stacked 2 layers of the graph encoder and aggregating information from neighboring nodes with the mean aggregator for LSTM and max pooling with the rest of the models.

During training, Adam optimizer [20] is used with a fixed learning rate of 0.001 for LSTM and ConvSeq2Seq, 0.0002 for LightConvSeq2Seq, and 0.0005 for Transformer.

Besides the basic models presented above, the results of the proposed model are also compared with the method of Song et al. [10]. To make a fair comparison, we have retrained Song’s model with the same preprocessed dataset and tuned hyperparameters.

After the models are trained, the BLEU score [21] was used to evaluate the translation quality. We also apply the bootstrap resampling method [22] to measure the statistical significance of BLEU score differences between translation outputs of proposed models compared to the baseline.

6. Results and Discussion

In this section, we present our experimental results and our analyzes on the results.

6.1. Results

Once the models have been trained, a beam search with the size of 5 is utilized to find a translation that maximizes the conditional probabilities.

With both the test sets tst2013 and tst2015, the proposed models are proven to be superior to the corresponding base model. In particular, as given in Table 2, with uni-LSTM-AMR-F and bi-LSTM-AMR, the BLEU scores are 27.21 and 29.29, respectively, which are 1.09 and 3.17 higher than Song’s method [10]. Similarly, with the set tst2015, bi-LSTM-AMR improved BLEU by 2.83, compared to Song’s method. This shows that despite using the double attention mechanism, bi-LSTM-AMR and uni-LSTM-AMR can integrate the information from AMR more effectively, thereby producing better translation results.

Meanwhile, when LightConvSeq2Seq is run on tst2013 and tst 2015, the BLEU scores are only 27.47 and 25.09, respectively. However, when integrating the AMR into the system, the BLEU score increased significantly by 1.0 and 0.58 on tst2013 and tst2015, respectively. Besides, LightConvSeq2Seq-AMR-F and LightConvSeq2Seq-AMR-B, which were integrated graph information from one direction, also outperform LightConvSeq2Seq, as given in Table 3.

As given in Table 4, ConvSeq2Seq also shows an improvement in machine translation quality with an increase in the BLEU score to about 0.3 for ConvSeq2Seq-AMR with tst2013. However, there is a BLUE decrease of 0.08 with tst2015. However, the ConvSeq2Seq-AMR-F model achieves the best results when integrating information from the forward neighbors. An increase of 0.1 in BLEU is observed with tst2013 and 0.5 with tst2015. Similar to Transformer, integrating information from the forward and backward neighbors in Transformer-AMR is not effective, with only an increase of 0.09 over the base model with tst2013. Only combining information from the forward neighbors in Transformer-AMR-F achieves a noticeable BLEU score of 28.88 and 26.28 with tst2013 and tst2015, respectively, which signal an increase of 0.28 and 0.52 compared to Transformer.

6.2. The Effect of AMR on the NMT Model

According to the results presented in Section 6.1, the bi-LSTM-AMR and LightConvSeq2Seq-AMR models improve BLEU more than the other two models, ConvSeq2Seq and Transformer. Therefore, to analyze the impact of AMR on the machine translation system, bi-LSTM-AMR and LightConvSeq2Seq-AMR models are selected for further training to examine graph elements such as information integration directions, graph encoding layers, and aggregators.

6.2.1. Bi-LSTM-AMR

(i). Direction and Depth. Figure 5 depicts the change in performance when adjusting the number of graph encoding layers. The mean aggregator is used to combine information from neighbors. In general, bi-LSTM-AMR and uni-LSTM-AMR-B show the highest translation quality throughout the 30 examined layers. However, an increase in the number of layers does not always help the model achieve a higher BLEU. A decrease in BLEU scores is also observed. The more stacked layers there are, the greater the amount of information the model could learn, which ultimately leads to the overfitting problem due to saturated information. All models obtain the best results with only 2 or 3 graph coding layers. As the number of layers increases, the BLEU scores decrease. Nevertheless, the results seem more consistent and less fluctuating with bi-LSTM than with uni-LSTM.

There are three aggregators used for aggregating information from neighboring nodes: mean aggregator (MA), max-pooling (MP) aggregator, and GCN aggregator (GCN-A). The strategy of using information from one direction (forward or backward) is also considered to make more accurate statements about the effect of the aggregator on the effectiveness of the model. The results in Table 5 show that Bi-LSTM-AMR-MA achieved the highest result on the two test sets with the BLEU scores of 29.29 and 26.41, respectively. Meanwhile, uni-LSTM-AMR-MA, which uses information from both sides, achieved lower BLEU scores than the variants uni-LSTM-AMR-F and uni-LSTM-AMR-B, which only combines information from the forward and the backward neighbors, respectively. Moreover, bi-LSTM-AMR-MA outperforms bi-LSTM-AMR-F and bi-LSTM-AMR-B due to its ability to capture information from two directions during the node embedding learning and combine with information from the bi-LSTM encoder. Therefore, the LSTM decoder can leverage information from the graph more efficiently to improve the machine translation quality. This shows that bidirectional aggregation is more useful when combined with a bidirectional LSTM encoder. Accordingly, uni-LSTM-AMR-F-MP and uni-LSTM-AMR-B-MP, which only combine information from one direction, achieve good results when used with a unidirectional LSTM encoder.

6.2.2. LightConvSeq2Seq-AMR

Similar to bi-LSTM-AMR, the LightConvSeq2Seq-AMR model is also affected by different aggregators. In particular, as given in Table 6, the mean aggregator (MA) yields better results on average values than the rest. The results with tst2015 show that all the three modes with MA both achieve much higher results than the rest of models. On the contrary, the GCN-A results are the lowest, similar to Seq2Seq. This proves that the information combination of GCN-A is not as efficient as those of MA and MP.

Figure 6 shows the change of BLEU when stacking convolutional blocks in the encoder and the decoder and the effect of heads in self-attention. On both sides, the BLEU scores increase when the number of heads increases. In particular, the figure on the left shows the LightConvSeq2Seq-AMR model with the configuration , which stacked 4 convolutional blocks at the encoder and 4 convolutional blocks at the decoder, and yields the best results. The BLEU scores are approximately 28 and 27.6 with just 1 head and then increases to 28.46 and 28.2 when . However, with an additional graph encoding layer, the configuration is inferior to the configuration. This configuration yields the highest results at with BLEU approximately 28.4 and observes a slight decrease as approaches 8. Meanwhile, and configurations tend to decline sharply as increases from 1 to 2 and continues to decline slightly until . Meanwhile, the two reconfigurations tend to be the opposite when adding a graph encoding layer, as shown on the right figure in Figure 6. The remaining configuration yields the lowest results for the 2 graph encoding layer options. The results also fluctuate more with 3 layers, as opposed to being nearly constant at 2 layers.

7. Conclusions

We proposed a method to integrate the AMR graphs into popular machine translation architectures such as Seq2Seq, ConvSeq2Seq, and Transformer. Structured semantic information from AMR graphs can supplement the context information in the translation model for a better representation of abstract information. Experimental results show that AMR graphs yield better results than other representations such as dependency trees or semantic roles.

For future studies, we plan to examine other methods to integrate more complex semantic graphs, such as Prague Semantic Dependencies, Elementary Dependency Structures, and Universal Conceptual Cognitive Annotation, and investigate different encoding methods suitable for a range of semantic graphs.

Appendix

A. Error Analysis

This section presents some translation errors of the proposed model.

In the first example in Table 7, with bi-LSTM-AMR, the model incorrectly predicts the phrase “and in V Magazine” to be “và V là Magazine.” Although the translation is incorrect, the model still recognizes “V Magazine” as a proper noun and that V is a magazine (“V là Magazine”). Meanwhile, both ConvSeq2Seq-AMR and Transformer-AMR cannot recognize this pattern and omit the word “Magazine” when translating. LightConvSeq2Seq-AMR is the only model that provides a relatively complete translation.

Example 2 in Table 8 illustrates the case in which the model still understands the meaning but selects the wrong representation. The English word “internal” is meant to complement the phrase “combustion engine,” which already entailed the meaning of “động cơ đốt trong.” In this case, ConvSeq2Seq-AMR and bi-LSTM -AMR has taken “internal” to mean “inside” as an adjective that modifies the location information of the engine and ignores the word “combustion” when translated into Vietnamese. Meanwhile, LightConvSeq2Seq-AMR and Transformer-AMR prove a better performance in capturing information, as they produce accurate translations.

Table 9 describes the case in which the model retains the meaning correctly, but the reference data are incorrect. The word “it” is translated to “những thông tin đó” in the data. This is an inaccurate translation because the word “it” refers to a singular entity, while the translation is in the plural form. Besides, there is only one sentence and no information about the surrounding context, so the results obtained from the proposed models are similar to one another. The Vietnamese word “nó” can be used to refer to previously mentioned things or events. It is thus highly ambiguous, causing difficulty in interpreting even for humans.

B. More Illustrative Results

Table 10 illustrates some sample translations of the models: Song’s method, bi-LSTM (base model), and bi-LSTM-AMR (proposed model).

Data Availability

The datasets used to support the findings of this study are from https://wit3.fbk.eu/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Long H. B. Nguyen and Viet H. Pham contributed equally to this work.

Acknowledgments

This research is funded by University of Science, VNU-HCM under grant number CNTT 2020-06.