Abstract

In allusion to the syntactic differences between Chinese and Naxi language, the thesis presents a tree-to-tree method of Chinese-Naxi machine translation based on subtree alignment. In this method, we define a subtree alignment model, providing its inference probability, and solve the alignment missing problem of Chinese-Naxi alignment by updating nodes (insert or delete). And then we train the subtree alignment model by EM algorithm, merging subtree alignment to the translation model. Finally we extract the template of Chinese-Naxi translation, adopting the extraction algorithm based on matrix, and implement the Chinese-Naxi machine translation. Result of the contrast experiment shows that, compared to the Chinese-Naxi machine translation method based on tree-to-tree translation, the translation accuracy increased after importing subtree alignment.

1. Introduction

Naxi characters are the only existing and active pictograph over the world. Work of Chinese-Naxi machine translation can contribute to the inheriting of Naxi characters effectively, as well as bilingual learning of Chinese-Naxi language. However, there are great differences between Naxi language and Chinese [14]. There are unique features in Naxi language such as “verbs being in the end of sentences,” “attribute being postpositional,” and “no auxiliary word,” which cause the alignment missing in Chinese-Naxi bilingual alignment. For example, the Chinese sentence “我喜欢吃黑雪山的松果” (I like to eat pine cones from black snow mountain) is translated to the Naxi sentence “ (I) (pine cones) (black snow mountain) (like) (to eat).” Due to the features of pictograph, Naxi language has no auxiliary words like “过” and “的” in Chinese, which causes the alignment missing when executing alignment procedure, as well as the confusion of the original sentence. Some solutions were presented to solve this problem. Xiao et al. presented a word realignment method for statistical machine translation [5], which contributes to the accuracy of machine translation by using the inconsistency of bidirectional word alignment. Xiao also presented a subtree alignment based on unsupervised learning [6, 7], which can extract more rules compared to word alignment and works well in solving alignment missing problem. As to the Chinese-Naxi word alignment, Li and Yu et al. did some researches on dependency-tree-to-string translation model [8, 9], which improved the translation accuracy in dependency-tree-to-string translation model. Li’s Chinese-Naxi machine translation method based on dependency-tree-to-string translation model made full use of the Chinese syntactic information, but no consideration to Naxi syntactic features. In fact those great syntactic differences between Chinese and Naxi language make the syntax analysis an important role in bilingual machine translation. Therefore, on basis of Li’s dependency-tree-to-string translation method, Gao et al. proposed a Chinese-Naxi machine translation method based on Naxi dependency language model, used dependency parsing syntax information of Naxi side to construct a Naxi dependency language model and fused it into decoding process; to a certain degree, this improved the accuracy of Chinese-Naxi machine translation [10]; However, about the alignment missing, caused by the other characteristic of Naxi language, no auxiliary word, this method did not refer to.

In allusion to aforementioned problem, if we adopt the statistical machine translation method based on dependency-tree-to-string model, there would be a large quantity of alignment missing due to the lack of consideration to Naxi syntactic features, which would cause the information loss of misaligned word in translation result. In the thesis, we proposed a tree-to-tree translation template of Chinese-Naxi translation based on subtree alignment, and in the procedure of extraction, we synthesize the syntactic features of each language, extend the translation template with misaligned word, and implement the translation of misaligned word in bilingual alignment finally.

2. A Chinese-Naxi Tree-to-Tree Translation Template Extraction Method Based on Subtree Alignment

2.1. Definition of the Subtree Alignment Model

Given a syntax tree pair (), we define the best inference to realize subtree alignment, and its formal definition is as follows:

Here, is the inferential space.

2.2. Inference Probability of the Subtree Alignment Model

The inference probability of the subtree alignment model is as follows:

Here, there are four factor probabilities: is the mapping probability of nonterminal symbols between Chinese and Naxi. is the generating probability of subtree which is extended from a node in Naxi side. is the mapping probability of terminal symbols between Chinese and Naxi. is the coding rule probability of reordering frontier nonterminal. The four conditional probabilities are expressed by bilingual aligned subtree instance, as shown in Figure 1, respectively.

According to the aligning rules of syntax tree with Chinese-Naxi subtree alignment in Figure 1, the four conditional probabilities may be expressed as shown in Table 1.

2.3. Node Deletion and Insertion

In the processing of Chinese-Naxi subtree alignment, if the Naxi language side is empty subtree, the inference probability is changed into

Here, is a special symbol, an expression of empty subtree.

If the Chinese subtree is empty, the inference probability is changed into

Here, is also a special symbol, an expression of empty subtree.

2.4. Training Subtree Alignment Model

EM algorithm [11] is used to train the model parameters and the four conditional probabilities; the training process to is shown in Algorithm 1.

(1) Function TrainModelWithEM
(2)  Set   = an initial model
(3)  For   to   do
(4)   Foreach non-terminal symbol pair ,   do  // for each corresponding non-terminal symbols
(5)    
E-step: //Compute Expected Count
(6)   Foreach tree pair (, ) in sequence   do  //for each sub-tree pair
(7)    Foreach node pair (, ) with symbol pair (, ) in (, ) do  // for frontier node pair
(8)     Foreach rule rooted at (, ) do  //Extract rule r from the sample
(9)      
M-step: //Compute Maximum Value
(10)  Foreach non-terminal symbol pair (, ) do
(11)     
(12)   return  

Here, and indicate the nonterminal of source language side and the one of target language side, respectively. is the Expected Count of . is the iteration times. and express the tree node of source ending and the one of target ending, respectively.

3. A Chinese-Naxi Tree-to-Tree Translation Template Extracting Algorithm Based on Subtree Alignment

Before extracting translation template, we must carry on syntactic analysis at both source language side (Chinese) and target language side (Naxi) in bilingual training corpus to figure out the subtree alignment relation. And then, we can do the extraction by adopting the extracting algorithm based on matrix [12]. According to the subtree alignment relation we can obtain the phrase tree as shown in Figure 1.

The extracting algorithm is shown in Algorithm 2.

(1) Function MatrixExtract(, , ) // Extract alignment matrix rules
(2)  Foreach node in   do
(3)   Foreach node in   do
(4)    If IsEXTRACTABLE  do // If the fuction returns true
(5)     next loop
(6)    Foreach tree-fragment pair , rooted at (, ) do
(7)     Foreach frontier alignment between (, )
(8)      If IsEXTRACTABLE() then
(9)        = CREATERULE // Create rule r
(10)       rules.ADD()
(11)  return  rules
(12) Function IsEXTRACTABLE()
(13)  Foreach alignment (, ) in   do
(14)   If probability of (, ) in then // When probability less than threshold  
(15)    return false
(16)  return true

Here, represents a subtree alignment matrix for the pair of trees . is an empirical threshold to control how often rules are pruned. In this work, it is by default.

4. Experiment and Result Analysis

4.1. Experimental Design

For the reason that research on machine translation of Naxi language is not as well-developed as that of English, public corpus is short of support. To our study, we collected 35,000 Chinese-Naxi parallel sentences from bilingual textbooks of elementary education and dialogues. After word segmentation, dependence syntax parsing, and bilingual word alignment annotation, we developed a Chinese-Naxi corpus for statistical machine translation, from which we selected 16,000 bilingual aligned parallel sentence pairs as development set and 7,000 sentence pairs as test set, and the experimental corpus is shown in Table 2.

In order to validate the accuracy of tree-to-tree Chinese-Naxi translation method based on subtree alignment, we design some contrast experiments to the translation methods, respectively, based on tree-to-string, string-to-tree, tree-to-tree, and tree-to-tree model with subtree alignment [13, 14]. In the experiments, we take “Niutrans” (1.3.0 version) as the basic prototype system [7]. From the sentence on source side (Chinese), the corresponding Chinese syntax trees are generated by Berkeley Parser, a syntactic analyzer based on Penn Treebank [15], and from the sentence on target side (Naxi), the corresponding Naxi syntax trees are also generated by Naxi syntactic analyzer which is developed by us. A 3-gram language model was trained by SRILM toolkit. EM algorithm trained a subtree alignment model. Matrix algorithm based subtree alignment extracted tree-to-tree translation template and got 512 template rules. Minimum Error Rate Training (MERT) trained our logarithm linear model and turned the weight parameters. The BLEU4 [16] is selected as the experimental measure.

4.2. Experimental Results and Analysis

The results of comparative experiments are shown in Table 3.

As we can see from Table 3, the Chinese-Naxi syntax translation system based on tree-to-tree model with subtree alignment is +1.5% of BLEU4 higher than that of the Chinese-Naxi syntax translation system based on tree-to-tree model with word alignment and +2.6% higher than that of the Chinese-Naxi syntax translation system based on tree-to-string model. The improvement is attributed to the fact that the Chinese-Naxi translation model based on tree-to-tree translation with subtree alignment solves alignment missing between Chinese and Naxi by deleting or inserting node. For example, when translating the Chinese sentence “我喜欢吃黑雪山的松果” (I like to eat pine cones from black snow mountain), the proposed method can delete the null alignment word “的” and thus got better result “ (I) (pine cones) (black snow mountain) (like) (to eat).” The experimental results show that the proposed translation model based on tree-to-tree translation with subtree alignment improves the accuracy of Chinese-Naxi machine translation. However, there are also some errors in our translation. After detailed analysis, we found that the main reasons are the characteristics of Naxi language, such as verbs being in the end of sentence, attribute being postpositional, polysemy for one word, and synonyms for multiple words, and the data sparseness that the small size of corpus brings about.

5. Conclusions

This thesis puts forward a Chinese-Naxi syntactic statistical machine translation method based on tree-to-tree model with subtree alignment. The experimental results show that when compared with the tree-to-tree translation model with word alignment, the tree-to-tree machine translation model with subtree alignment wins +1.5% of BLEU4. In the next step, we will expand the scale of corpus. At the same time, aiming at serious structural differences between Chinese syntax and Naxi syntax, we will continue to work on the Chinese-Naxi tree-to-tree machine translation model and attempt to integrate semantic information into the translation model to improve quality and performance of Chinese-Naxi syntactic statistical machine translation.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is supported by National Natural Science Foundation of China (Grant nos. 61163022 and 61163004).