Abstract

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.

1. Introduction

The research about machine translation (MT) can be traced back to fifty years ago [1] and people benefit much from it about the information exchange with the rapid development of the computer technology. Many MT methods and automatic MT systems were proposed in the past years [24]. Traditionally, people use the human evaluation approaches for the quality estimation of MT systems, such as the adequacy and fluency criteria. However, the human evaluation is expensive and time consuming. This leads to the appearance of the automatic evaluation metrics, which give quick and cheap evaluation for MT systems. Furthermore, the automatic evaluation metrics can be used to tune the MT systems for better output quality. The commonly used automatic evaluation metrics include BLEU [5], METEOR [6], TER [7], AMBER [8], and so forth. However, most of the automatic MT evaluation metrics are reference aware, which means they tend to employ different approaches to calculate the closeness between the hypothesis translations offered by MT systems and the reference translations provided by professional translators. There are some weaknesses in the conventional reference-aware methods: (1) how many reference translations are enough to avoid the evaluation bias since that reference translations usually cannot cover all the reasonable expressions? (2) The reference translation is also expensive and sometimes not approachable in practice. This paper will propose an automatic evaluation approach for English-to-German translation by calculating the similarity between source and hypothesis translations without using of reference translation. Furthermore, the potential usage of the proposed evaluation algorithms in the traditional reference-aware MT evaluation tasks will also be explored.

2. Traditional MT Evaluations

2.1. BLEU Metric

The commonly used BLEU (bilingual evaluation understudy) metric [5] is designed as automated substitute to skilled human judges when there is need for quick or frequent MT evaluations: where means modified -gram precision on a multisentence test set, which is for the entire test corpus. It first computes the -gram matches sentence by sentence and then adds the clipped -gram counts for all the candidate sentences and divides by the number of candidate -gram in the test corpus. Consider where is the sentence brevity penalty for short sentences, is the effective reference sentence length (the closest reference sentence length with the candidate sentence), and is the length of candidate sentence. Generally, is selected as 4, and uniform weight is assigned as . Thus we could get the deduction as follows:

This shows that BLEU reflects the geometric mean of -gram precision values multiplied by brevity penalty. As a contrast and simplified version, Zhang et al. [9] proposes modified BLEU metric using the arithmetic mean of the -gram precision:

The weaknesses of BLEU series are that they focus on the usage of incomprehensive factors, precision scores only; they do not use any linguistic features, only utilizing the surface words.

2.2. TER Metric

TER [7] means translation edit rate, which is designed at sentence level to calculate the amount of work needed to correct the hypothesis translation according to the closest reference translation (assuming there are several reference translations):

The edit categories include the insertion, deletion, substitution of single words, and the shifts of word chunks. TER uses the strict matching of words and word order, for example, miscapitalization is also counted as an edit. The weakness of TER is that it gives an overestimate of the actual translation error rate since that it requests accurate matching between the reference and hypothesis sentence. To address this problem, they proposed the human-targeted TER (HTER) to consider the semantic equivalence, which is achieved by employing human annotators to generate a new targeted reference. However, HTER is very expensive due to that it requires around 3 to 7 minutes per sentence for a human to annotate, which means that it is more like a human judgment metric instead of an automatic one.

2.3. METEOR Metric

METEOR [6] metric conducts a complicated matching, considering stems, synonyms, and paraphrases. Consider where #chunks means the number of matched chunks between reference and hypothesis sentence, and #unigrams_matched is the number of matched words. It puts more weight on recall compared with precision . The matching process involves computationally expensive word alignment due to the external tools for stemming or synonym matches. The advanced version of METEOR is introduced in [10].

2.4. AMBER Metric

AMBER [8] declares a modified version of BLEU. It attaches more kinds of penalty coefficients, combining the -gram precision and recall with the arithmetic average of -measure (harmonic mean of precision and recall with the equal weight). It provides eight kinds of preparations on the corpus including whether the words are tokenized or not, extracting the stem, prefix and suffix on the words, and splitting the words into several parts with different ratios. Advanced version of AMBER was introduced in [11]. Other related works about traditional reference-aware MT evaluation metrics can be referred to in the papers [12, 13], our previous works [14, 15], and so forth.

As mentioned previously, the traditional evaluation metrics tend to estimate quality of the automatic MT output by measuring its closeness with the reference translations. To address this problem, some researchers design the unsupervised MT evaluation approaches without using reference translations, which is also called quality estimation of MT. For example, Han et al. design an unsupervised MT evaluation for French and English translation using their developed universal phrase tagset [16] and explore the performances of machine learning algorithms, for example, conditional random fields, support vector machine and naïve Bayes in the word-level quality estimation task of English to Spanish translation without using golden references [17]. Gamon et al. [18] conduct a research about reference-free MT evaluation approaches also at sentence level, which utilizes the linear and nonlinear combinations of language model and SVM classifier to find the badly translated sentences. Using the regression learning and a set of indicators of fluency and adequacy as pseudoreferences, Albrecht and Hwa [19] present an unsupervised MT evaluation work at sentence level performance. Employing the confidence estimation features and a learning mechanism trained on human annotations, Specia and Giménez [20] develop some quality estimation models, which are biased by difficulty level of the input segment. The issues between the traditional supervised MT evaluations and the latest unsupervised MT evaluations are discussed in the work of [21]. The quality estimation addresses this problem by evaluating the quality of translations as a prediction task and the features are usually extracted from the source sentences and target (translated) sentences. Using the IBM model one and the information of morphemes, lexicon probabilities, part-of-speech, and so forth, Popović et al. [22] also introduces an unsupervised evaluation method and show that the most promising setting comes from the IBM-1 scores calculated on morphemes and POS-4gram. Mehdad et al. [23] use the cross-lingual textual entailment to push semantics into the MT evaluation without using reference translations, which mainly focuses on the adequacy estimation. Avramidis [24] performs an automatic sentence-level ranking of multiple machine translations using the features of verbs, nouns, sentences, subordinate clauses, and punctuation occurrences to derive the adequacy information. Other related works that introduce the unsupervised MT evaluations include [25, 26].

4. Designed Approach

To reduce the expensive reference translations provided by human labor and some external resources such as synonyms, this work employs the universal part-of-speech (POS) tagset containing 12 universal tags proposed by [27]. A part-of-speech is a word class, a lexical class, or a lexical category, which is a linguistic category of words (or lexical items). It is generally defined by the syntactic or morphological behavior of the lexical item in question.

For a simple example, “there is a big bag” and “there is a large bag” could be the same expression, “big” and “large” having the same POS as adjective. To try this potential approach, we conduct the evaluation on the POS of the words from the source language and the target language. The source language is used as pseudoreference. We will also test this method by calculating the correlation coefficient of this approach with the human judgments in the experiment. Petrov et al. [27] describe that the English PennTreebank [28] has 45 tags and German Negra [29] has 54 tags. However, in the mapping table they offer 49 tags for the German Negra Treebank. This paper makes a test on the Berkeley parser [30] for German (trained on the German Negra) parsing a German corpus from the workshop of machine translation (WMT) 2012 [25] and finds that there are indeed other POS tags that are not included in the mapped 49 tags. So, firstly this paper conducts a complementary mapping work for German Negra POS tagset and extends the mapped POS tags to 57 tags.

4.1. Complementary POS Mapping

The parsing test result of WMT 2012 German corpus shows that the omissive German POS tags in the mapping table include “PWAV, PROAV, PIDAT, PWAT, PWS, PRF, $ LRB , and TN ,” of which “ TN ” is a formal style ( is replaced in practical parsing results by the integer number such as 1, 2, etc.).

This paper classifies the omissive German POS tags according to the English POS tagset classification since that the English PennTreebank 45 POS tags are completely mapped by the universal POS tagset, as shown in Table 1.(i)The German POS “PWAV” has a similar function to English POS “WRB” which means wh-adverb labeling the German word such as während (while), wobei (where), wann (when), and so forth. The German POS “PROAV” has a similar function to English POS “RB” which means adverb labeling the German word Dadurch (thereby), dabei (there), and so forth. So this paper classifies “PWAV” and “PROAV” into the ADV (adverb) category in the 12 universal POS tags.(ii)The German POS “PIDAT” has a similar function to English POS “PDT” which means predeterminer labeling the German word jedem (each), beide (both), meisten (most), and so forth. The German POS “PWAT” has a similar function to English POS “WDT” which means wh-determiner labeling the German word welche (which), welcher (which), and so forth. So “PIDAT” and “PWAT” are classified into the DET (determiner) category in the universal POS tags.(iii)The German POS “PWS” has a similar function to English POS “WP” which means wh-pronoun labeling the German word was (what), wer (who), and so forth. So “PWS” is classified into the PRON (pronoun) category in the universal POS tags.(iv)The German POS “PRF” has a similar function to English POS “RP” and “TO”, which means particle and to, respectively, labeling the German word sich (itself). So this paper classifies “PRF” into the PRT (a less clear case for particle, possessive, and to) category.(v)The German POS “ TN ” and “$ LRB ” are classified into the punctuation category since that they are only used to label the German punctuations such as dash, bracket, and so forth. After all of the complementary mapping, the universal POS tagset alignment for German Negra Treebank is shown in Table 1 with the boldface POS as the added ones.

4.2. Calculation Algorithms

The designed calculation algorithms of this paper are LEPOR series. First, we introduce the nLEPOR model, -gram based quality estimation metric for machine translation with augmented factor of enhanced length penalty, precision, position difference penalty, and recall. We will introduce the subfactors in the formula step by step:

In the formula, means enhanced sentence length penalty that is designed for both the shorter or longer translated sentences (hypothesis) compared with the source sentence. This approach is different with BLEU metric which assigns penalty for the shorter sentence compared with the human reference translation. Parameters and specify the length of candidate sentence (hypothesis) and source sentence, respectively.

The variable means -gram position difference penalty that is designed for the different order of successful matched POS in source and hypothesis sentence. The position difference factor has been proved to be helpful for the MT evaluation in the research work of [13]. The alignment direction is from hypothesis to the source sentence with the algorithm shown in Algorithm 1. This paper employs the -gram method into the matching process, which means that the potential POS candidate will be assigned higher priority if it has neighbor matching. The nearest matching will be accepted as a backup choice if there are both neighbor matching or there is no other matched POS around the potential pairs. In Algorithm 1, assuming that represents the current universal POS in hypothesis sentence, means the th POS to the previous or following position. It is the similar approach for the source sentence. Consider

Hypothesis  Sentence (POS):
Source  Sentence (POS):
, The Alignment of POS in hypothesis:
If    : // means for each, means there is/are
   ; // →shows the alignment
elseif : // ! means there exists exactly one
;
elseif : // is logical conjunction, and
foreach
  foreach
   if :
    if
         ;
    else
         ;
   elseif :
     ;
   else  // i.e. :
     if
       ;
     else
       ;
else // when more than two candidates, the selection steps are similar as above

The parameter means the length of hypothesis sentence, MatchPosNhyp, and MatchPosNsrc as the matched POS position number in hypothesis and source sentence, respectively. See Figure 1 as an example.

The parameter is designed to adjust the weights of different -gram performances such as unigram, bigram, trigram, four-gram, and so forth, which is different with the weight assignment in BLEU where each weight is equal to . In our model, higher weight value is designed for the high level -gram. Consider the following:

The factor is the mathematical harmonic mean of -gram precision ( ) and -gram recall ( ) in (10), where represents the number of matched -gram chunks. The -gram precision (and recall) is calculated on sentence level not corpus-level used in BLEU ( ). Let us see the example in Figure 1 again for the explanation of bigram precision and bigram recall . The number of bigram chunks in hypothesis is 4 (PRON NOUN, NOUN NOUN, NOUN VERB, and VERB NUM), the number of bigram chunks in source is 5 (PRON NOUN, NOUN VERB, VERB VERB, VERB NUM, and NUM NOUN), and the number of matched bigram chunk is 3 (PRON NOUN, NOUN VERB, and VERB NUM) as shown in Figure 2. So the value of and equals 3/4 and 3/5, respectively.

4.3. System-Level Metric

We design two approaches for document-level calculation for the proposed algorithms:

The document-level is calculated as the arithmetic mean value of each sentence-level score in the document. On the other hand, the document-level is calculated as the product of three document-level variables, and the document-level variable value is the corresponding arithmetic mean score of each sentence.

5. Usage in Traditional Evaluation

We introduce the potential usage of the proposed evaluation algorithms in traditional supervised (reference-aware) MT evaluation methods. In the unsupervised design, the metric is measured on the source and target POS sequences instead of the surface words. In the exploration of its usage in the traditional supervised MT evaluations, we measure the score on the target translations (system outputs) and reference translations, that is, measuring on the surface words. The pseudoreference (source language) is replaced with real reference here. The performance of simplified variant will be tested using the reference translations.

6. Evaluating the Evaluation Method

The conventional method to evaluate the quality of different automatic MT evaluation metrics is to calculate their correlation scores with human judgments. The Spearman rank correlation score and Pearson correlation score are commonly used by the annual workshop of statistical machine translation (WMT) of Association for Computational Linguistics (ACL) [25, 31, 32]. Assuming that and are two rank sequences and is the number of variables, when there are no ties, Spearman rank correlation coefficient is calculated as where is the difference value ( -value) between the two coordinate rank variables .

Secondly, the Pearson correlation coefficient information is introduced as below. Given a sample of paired data as ,   to , the Pearson correlation coefficient is where and specify the arithmetical means of discrete random variable and , respectively.

7. Experiments

7.1. Unsupervised Performances

In the unsupervised MT evaluation, this paper uses the English-to-German machine translation corpora (produced by around twenty English-to-German MT systems) from ACL-SIGMT (http://www.sigmt.org/), which makes the annual workshop corpora public available for further research purpose. Each document contains 3,003 sentences of source English or translated German. To avoid the over fitting problem, the WMT2011 (http://www.statmt.org/wmt11/) corpora are used as training data and WMT2012 (http://www.statmt.org/wmt12/) corpora are used for the testing. This paper conducts the experiments on the simplified version (unigram precision and recall) of the metric.

Table 2 shows the system-level Spearman rank correlation scores of nLEPOR with human judgments trained on the WMT2011 data, as compared to several state-of-the-art reference-aware automatic evaluation metrics including BLEU, METEOR, and AMBER.

In the training period, the parameter values of (weight on recall) and (weight on precision) are tuned to 1 and 9, respectively, which is different with the reference-aware metric METEOR (more weight on recall). Bigram is selected for the -gram universal POS alignment period. The correlation scores in Table 2 show that the proposed evaluation approaches of this paper have achieved higher score (0.63 and 0.60, resp.) in training period than the other evaluation metrics (with the score 0.53, 0.44, and 0.30, resp.) apparently even though the compared metrics are reference aware.

Testing result of the proposed evaluation approaches on the WMT2012 corpora is shown in Table 3 with the same parameter values obtained as in training, also compared with the state-of-the-art evaluation metrics. The correlation scores in Table 3 show the same rank results with Table 2, METEOR achieving the lowest score (0.18), AMBER (0.25) achieving score higher than BLEU (0.22), and NLEPOR family yielding the highest correlation coefficient (0.34 and 0.33, resp.) with human judgments as previous. The test results show the robustness of the proposed evaluation approaches. The experiments also show that the latest proposed metrics (e.g., AMBER) achieve higher correlation score than the earlier ones (e.g., BLEU).

7.2. Supervised Performances

As mentioned previously, to explore the performance of the designed algorithm in the traditional reference-aware MT evaluation track, we also use the WMT11 corpora as training data and WMT12 corpora as testing data to avoid the overfitting phenomenon. The number of participated MT systems that offer the output translations is shown in Table 5 for each language pair. In the training period, the tuned values of and are 9 and 1, respectively, for all language pairs except for ( , ) for CS-EN. The training results on WMT11 eight corpora including English-to-other (CS-Czech, DE-German, ES-Spanish, FR-French), and other-to-English are shown in Table 4. The aim of the training stage is to achieve higher correlation with human judgments. The experiments on WMT11 corpora show that yields the best correlation scores on the language pairs of CS-EN, ES-EN, EN-CS, and EN-ES, which contributes to the highest average score 0.77 on all the eight language pairs. The testing results on WMT12 corpora show that yields the highest correlation score with human judgments on CS-EN (0.89) and EN-ES (0.45) language pairs, and the highest average correlation scores on other-to-English (0.85), English-to-other (0.58), and all the eight corpora (0.71) (Table 6).

7.3. Enhanced Model and the Performances

In the previous sections, we have introduced the -gram based metric nLEPOR and its performances in both supervised and unsupervised cases. In this section we will introduce an enhanced version of the proposed metric, which is called as hLEPOR (harmonic mean of enhanced length penalty, precision, -gram position difference penalty, and recall). There are two contributions of this enhanced model. Firstly, it assigns different weights to three subfactors, which are tunable according to different language pairs. This property can help to address the language-bias problem existing in many current automatic evaluation metrics. Secondly, it is designed to combine the performances on words and POS together, and the final score is the combination of them: where is the harmonic mean of precision and recall as mentioned above . Consider

Firstly, we calculate the hLEPOR score on surface words , that is, the closeness of the hypothesis translation and the reference translation. Then we calculate the hLEPOR score on the extracted POS sequences , that is, the closeness of the corresponding POS tags between hypothesis sentence and reference sentence. The final score is the combination of the two subscores and .

We introduce the performances of nLEPOR and hLEPOR in the WMT13 (http://www.statmt.org/wmt13/) shared evaluation tasks below. Both of the two metrics are trained on WMT11 corpora. The tuned parameters of hLEPOR are shown in Table 7, using default values for EN-RU and RU-EN. The number of MT systems for each language pair in WMT13 is shown in Table 8. There is a new language Russian in WMT13, which leads to the increasing of total number of corpora into ten. Due to the fact that there is no Russian language in the past WMT shared tasks, for EN-RU and RU-EN corpora, we assign the default values and in nLEPOR.

In the WMT13 shared tasks, both Pearson correlation coefficient and the Spearman rank correlation coefficient are used as the evaluation criteria. So we list the official results in Tables 9 and 10, respectively, by using Spearman rank correlation and Pearson correlation criteria.

Using the Spearman rank correlation coefficient, the experiment results on WMT13 in Table 9 show that yields the highest correlation scores on EN-DE (0.90), EN-RU (0.85), and the highest average correlation score (0.85) on five English-to-other corpora; yields the highest correlation scores on EN-DE (0.90), EN-ES (0.85), and EN-FR (0.92) and the second highest average correlation score (0.84) on five English-to-other corpora.

Using the Pearson correlation coefficient, the experiment results on WMT13 in Table 10 show that yields the highest correlation scores on EN-DE (0.94), EN-ES (0.91), and EN-RU (0.77) and the highest average correlation score (0.86) on five English-to-other corpora; yields the highest correlation scores on EN-ES (0.82) and EN-FR (0.92) and the second highest average correlation score (0.85) on five English-to-other corpora.

On the other hand, METEOR yields the best performances on other-to-English translation evaluation direction. This is due to the fact that METEOR employs many external materials including stemming, synonyms vocabulary, paraphrasing resources, and so forth. However, to make the evaluation model concise, our method nLEPOR only uses the surface words and hLEPOR only uses the combination of surface words and POS sequences. This shows the advantages of our designed methods, that is, concise linguistic feature.

As mentioned previously, the language-bias problem is presented in the evaluation results of BLEU metric. BLEU yields the highest Spearman rank correlation score 0.99 on FR-EN; however it never achieves the highest average correlation score on any translation direction, due to the very low correlation scores on RU-EN, EN-DE, EN-ES, and EN-RU corpora. Our metrics LEPOR series address the language-bias problem by using augmented factors and tunable parameters.

The evaluation results in Tables 9 and 10 show that the evaluation on the language pairs with English as the source language (English-to-other) is the main challenge at system-level performance. Fortunately, the designed evaluation methods of this paper have made some contributions on this translation evaluation direction.

8. Discussion

The Spearman rank correlation coefficient is commonly used in the WMT shared tasks as the special case of Pearson correlation coefficient applied to ranks. However, there is some information that cannot be reflected by using Spearman rank correlation coefficient instead of Pearson correlation coefficient. For example, let us assume there are three automatic MT systems and two automatic MT evaluation systems and . The evaluation scores of the two evaluation systems on the three MT systems are and , respectively. Using the Spearman rank correlation coefficient, the two vectors will be first converted into and , respectively, then their correlation with human rank results (human judgments) will be measured. Thus, the two evaluation metrics will yield the same Spearman rank correlation score with human judgments no matter what the manual evaluation results will be. However, the evaluation systems and actually tell different results on the quality of the three MT systems. The evaluation system gives very high score 0.90 on system and similar low scores 0.35 and 0.40, respectively, on and systems. On the other hand, the evaluation system yields similar scores 0.46, 0.35, and 0.42, respectively, on the three MT systems, which means that all the three automatic MT systems have low translation quality. Using the Spearman rank correlation coefficient, the above important information is lost.

On the other hand, the Pearson correlation coefficient uses the absolute scores yielded by the automatic MT evaluation systems as shown in (13), without the preconverting into rank values.

9. Conclusions and Future Works

To avoid the usage of expensive reference translations, this paper designs a novel unsupervised MT evaluation model for English to German translation by employing augmented factors and universal POS tagset. Furthermore, the proposed unsupervised model yields higher correlation score with human judgments as compared to the reference-aware metrics METEOR, BLEU, and AMBER.

The application of the designed algorithms to the traditional supervised evaluation tasks is also explored. To address the language-bias problem in most of the existing metrics, tunable parameters are assigned to different subfactors. The experiment on WMT11 and WMT12 corpora shows that our designed algorithms yields the highest average correlation score on eight language pairs as compared to the state-of-the-art reference-aware metrics METEOR, BLEU, and TER.

On the other hand, to address the linguistic-extreme problem (no linguistic information or too many linguistic features), our method utilizes the optimized linguistic feature POS sequence, in addition to the surface words, to make the model concise and easy to repeat.

Last, this paper also makes a contribution to the complementary POS tagset mapping between German and English in the light of 12 universal tags.

The developed algorithms in this paper are freely available for research purpose (https://github.com/aaronlifenghan/aaron-project-lepor). In the future works, to test the robustness of the designed algorithms and models, we will seek more language pairs, such as the Asian languages Chinese, Korean, and Japanese, to conduct the experiments, in addition to the official European languages offered by SIGMT association. Secondly, the experiments using multireferences will be considered. Thirdly, how to handle the MT evaluation from the aspect of semantic similarity will be further explored.

Appendix

This appendix offers the POS tags and their occurrences number in one of WMT2012 German documents containing 3,003 sentences and parsed by Berkeley parser for German language, that is, trained on German Negra Treebank.

Number of different POS tags in the document is 55. Detailed POS labels (boldface POSs are not covered in the original mapping) and their frequencies are given in Table 11.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors are grateful to the Science and Technology Development Fund of Macau and the Research Committee of the University of Macaufor the funding support for their research, under the Reference nos.MYRG076(Y1-L2)-FST13-WF, and MYRG070(Y1-L2)-FST12-CS.