Advances in Artificial Intelligence

Volume 2015, Article ID 927063, 10 pages

http://dx.doi.org/10.1155/2015/927063

## A Dirichlet Process Mixture Based Name Origin Clustering and Alignment Model for Transliteration

School of Computer Science and Technology, Harbin Institute of Technology, No. 92, West Dazhi St., Harbin 150001, China

Received 25 November 2014; Revised 28 April 2015; Accepted 16 June 2015

Academic Editor: Panayiotis G. Georgiou

Copyright © 2015 Chunyue Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In machine transliteration, it is common that the transliterated names in the target language come from multiple language origins. A conventional maximum likelihood based single model can not deal with this issue very well and often suffers from overfitting. In this paper, we exploit a coupled Dirichlet process mixture model (cDPMM) to address overfitting and names multiorigin cluster issues simultaneously in the transliteration sequence alignment step over the name pairs. After the alignment step, the cDPMM clusters name pairs into many groups according to their origin information automatically. In the decoding step, in order to use the learned origin information sufficiently, we use a cluster combination method (CCM) to build clustering-specific transliteration models by combining small clusters into large ones based on the perplexities of name language and transliteration model, which makes sure each origin cluster has enough data for training a transliteration model. On the three different Western-Chinese multiorigin names corpora, the cDPMM outperforms two state-of-the-art baseline models in terms of both the top-1 accuracy and mean *F*-score, and furthermore the CCM significantly improves the cDPMM.

#### 1. Introduction

Machine transliteration concentrates on translating a word or phrase from one writing system to another based on the pronunciation, which is a major way of importing foreign words such as proper nouns and technical terms into the target languages. It is also a major method for out-of-vocabulary words translation which commonly consists of people names, places names, company names, and products names. Machine transliteration is an essential task for many NLP applications such as automated question answering (QA), machine translation (MT), cross language information retrieval (CLIR), and information extraction (IE).

Machine transliteration has emerged around many years ago as a part of machine translation. The methods used for transliteration were often based on the phonetic origin of source and target languages at first, but now the spelling-based statistical approaches have nearly dominated this field. Because the spelling-based methods directly align characters on the training corpus based on the statistical results and do not require language specific phonetic knowledge, they are language independent and capable of achieving state-of-the-art performance [1]. However, there are still several common challenges [2] in spelling-based machine transliteration, such as script specifications, language of origin, missing sounds, and overfitting. In this paper, we concentrate on the language of origin and overfitting problems.

The origin problem of the language, also known as multiorigin identification, can affect both the alignment step and decoding step of statistical transliteration. In the Western-Chinese machine transliteration, the bilingual training corpus usually contains name pairs originated from more than one language. Names coming from different languages have their own transliteration pronunciation rules. For example, “kim Jong-il/金正恩” (Korea), “kana Gaski/金崎” (Japan), “haw king/霍金” (English), “jin yong/金庸” (China), “mundinger/蒙丁格” (English), “ding guo/丁果” (Chinese).

The same Chinese character “金” should be aligned to different romanized character sequences: “金: kim,” “金: kana,” “金: king,” and “金: jin.” In this case, it is not easy for a transliteration model, which does not take the names origin into consideration, to learn the right alignment for “金” and it is the same difficulty for the character “丁.”

Overfitting is another problem which exists in many previous alignment models such as GIZA++ [3], M2M-aligner [4], and HMM [5] in transliteration. These alignment models are optimized based on expectation maximization and try to fit the training data by maximum likelihood (ML) estimation. In practice the training data usually has some kind of deficiency, such as containing noise data and being not large enough. So the training data often cannot hold the real data distribution. When a ML model learns on this kind of training dataset, it often suffers from overfitting.

In this paper, we exploit a simple and fully unsupervised model to solve the two problems above, which is able to both cluster and align simultaneously. The coupled Dirichlet process mixture model (cDPMM) [6] integrates a Dirichlet mixture model for name pairs clustering and a set of Bayesian bilingual alignment models (BBAM) [7] for each cluster. The clustering and alignment models synergistically complement one another: the clustering model groups the training data into the right origin cluster so that self-consistent alignment models can be built on these data of the same type, and at the same time the alignment probabilities from the alignment models can direct the clustering process. Furthermore, based on the cluster and alignment results, we propose a cluster combination method (CCM) to build the cluster-specific transliteration model and language model for each cluster. In the decoding step, given a source name, we classify it into the most similar cluster based on the language model and transliterate it with the cluster-specific transliteration model. In this way, we use the origin information to direct the alignment and decoding step and obtain significant improvements.

Our major contributions in this paper are summarized as follows:(i)We exploit a Dirichlet process mixture model for clustering name pairs, which is fully unsupervised and does not require setting the cluster numbers aforehand. It is effectively capable of discovering an appropriate number of clusters from the data automatically.(ii)We apply the Bayesian segmentation alignment model in the alignment step over each cluster, which allows many-to-many monotonic alignment and can overcome overfitting.(iii)The clustering and alignment models are coupled in a unified model, and they work synergistically to support each other.(iv)By combining the small clusters into large ones based on the perplexities of name language models and transliteration models, we build the combined cluster-specific transliteration model and language model and can transliterate each source name based on its origin.(v)We conduct experiments on the three Western to Chinese multiorigin datasets and the results show that our cDPMM and CCM methods are competitive in terms of accuracy and mean -score compared to other methods.

The rest of the paper is organized as follows: we start by introducing related works in Section 2. Then we describe the detail of our coupled Dirichlet process mixture model in Section 3. In Section 4 we introduce the cluster combined method which can obtain large size of training data for building cluster-specific transliteration model and language model. Section 5 provides the experimental setup, results, and analysis. Finally, we conclude this paper and discuss the future work.

#### 2. Related Work

Machine transliteration methods can be categorized into phonetic-based models [8], spelling-based models [9], and hybrid models which utilize both the phonetic and spelling information [10, 11]. Among them, spelling-based models, which directly align characters in the training corpus based on the statistical learning, have been a popular method in transliteration because it is language independent and phonetic knowledge independent, and the performance often is significantly higher compared with phonetic-based models.

In some cases, transliteration can be viewed as a special instance of machine translation, so the aligning and decoding methods in statistical machine translation (SMT) [12] can be used in transliteration. Many previous works [13, 14] have built transliteration models based on GIZA++ [3] and Moses decoder [12]. In these works, a single character or syllable in a name pair is treated as a word in a sentence pair, and a monotonic many-to-many alignment is carried out. As the alignment is monotonic and the decoding is simpler in machine transliteration, many transliteration-specific methods have been proposed, such as Alpha-Beta Model [15], Joint Source Channel Model [16], and M2M-aligner [4]. These models are optimized based on expectation maximization and always have the overfitting problem.

Recently, the nonparametric Bayesian model is widely used in many natural language process (NLP) tasks and achieves competitive results. Someone has tried to use this kind of model to address the transliteration. A Bayesian bilingual alignment model (BBAM) [7] is proposed to segment and align the training name pair, which uses Dirichlet process to model the alignment sequence and treats basic transliteration units as samples, and applies a blocked version of a Gibbs sampler [17] to get each transliteration unit. The BBAM has been integrated to many transliteration methods to improve transliteration task [18] and transliteration mining task [19]. In [20], Huang et al. propose a nonparametric Bayesian method to train synchronous adaptor grammars for transliteration. In this paper, we also integrate the BBAM [7] to obtain transliteration alignment for easing the overfitting issues.

The multiorigin problem in transliteration is firstly proposed by Huang et al. [20]. They choose an unsupervised bottom-up hierarchical clustering algorithm and use the language and translation model to calculate the similarity of every two clusters, and several cluster-specific translation and classification models are built based on the clustering results. Li et al. [21] propose a supervised classification model to classify the name pairs based on their language origins and genders, which chooses the most similar cluster-specific model for a source name. Hagiwara and Sekine propose a latent semantic transliteration model (DM-LST) [22] to integrate the clustering and alignment model, which is an extension to their latent class transliteration model [23] and applies Dirichlet mixture as a prior distribution for distribution of alignment units to address overfitting.

#### 3. Coupled Dirichlet Process Mixture Model

In [6], Li et al. propose a new transliteration model called coupled Dirichlet process mixture model (cDPMM), which simultaneously clusters and bilingually aligns training data. And it is a fully unsupervised method which can discover an appropriate number of clusters for the training data. In this section we briefly make a description of this model. We first give the terminologies that will be used in the following section. Then we describe the detail of cDPMM; in Section 3.2 the Bayesian bilingual alignment model (BBAM) is presented, and in Section 3.3 we describe the cDPMM for clustering and the strategy to couple the BBAM.

##### 3.1. Terminology

In this work, we concentrate on solving the overfitting and multiorigins problems in transliteration. Firstly, we denote the training set itself as a set of sequence-pairs: , where is the name pairs size of training data. The cDPMM will cluster and segment every bilingual name pair into bilingual character sequence-pair in the transliteration alignment stage. As in [6], we call every character sequence-pair* Transliteration Unit (TU)*. We denote the source side and target side of a as and , respectively, where is a single character in source (target) name. So a can be denoted as . We use the same notation to denote a transliteration pair , where is the number of s used to segment . Our aim is to obtain a bilingual alignment for each name pair , where each is a segment of the whole pair (also means a ). We use to indicate one derivation for , and for each there is a derivation set . For each cluster obtained by cDPMM, we use to indicate the cluster ID, to indicate the set of name pairs in this cluster, to represent the alignment model parameters of cluster , and to represent the clustered name pair dataset.

##### 3.2. Alignment Model

In this work, we use a Bayesian model [7] in the alignment step, which implements bilingual segmentation and alignment based on forward filter and backward sampling algorithm to get s, and use the multinomial Dirichlet process to model the alignment process.

###### 3.2.1. Multinomial Dirichlet Process

The alignment component of our cDPMM is a multinomial Dirichlet process. In our work, the Dirichlet process is a stochastic process defined over the set of all possible s in the training dataset and its sample path is a probability distribution on the set . Our model has two basic components: one model for generating a which has been generated at least once and the other one for assigning a probability to a that has not yet been seen. The probability of generating a new should be considerably lower than that of generating the other old s, and the more s are generated the more reliable and complete the prior distribution is. So the model prefers the s which have been seen more than the unseen ones. Theoretically, our model can generate any sequence-pair of the training dataset in the bilingual language. The Dirichlet process segmentation and alignment model can be written in the following form:

is a sampled discrete probability distribution over the set , where is the base measure and is the concentration parameter for the distribution . The larger is, the more similar will be to .

###### 3.2.2. The Bayesian Bilingual Alignment Model

In our work, the generation process of a s sequence can be described by the Chinese Restaurant Process (CRP) [24]. Every corresponds to the dish served at its table, and the number of customers seated at each table represents the cumulative count of the . A new customer who comes to the restaurant can take a seat at an occupied table with a probability proportional to the number of customers in this table or chooses to seat at a new table with the probability given by .

In [7], Finch and Sumita propose the Bayesian bilingual alignment model (BBAM) and use a joint spelling model to assign probability on new sequence-pairs according towhere is the length of the source (target) side of the , is the vocabulary (alphabet) size of the source (target) language, and is the expected length of source (target) side.

According to the definition of , source and target sequences are generated independently. For a new , we first choose a length from a Poisson distribution for the source (target) sequence and then generate the sequence based on the vocabulary size of the source (target) language. Theoretically, our can generate a with any lengths and favor shorter sequences.

We use (3) to generate the th . It gives a probability to based on the property of Dirichlet process and the other s seen in the history so far:where is the total number of s generated so far and is the count number of which has been seen in the history. are all the s generated so far except . and are the base measure and concentration parameter as before.

###### 3.2.3. The Blocked Gibbs Sampling

In the BBAM the basic samples are s. However the training data in our transliteration problem are name pairs, and a Blocked Gibbs Sampling [25] Algorithm is used here to obtain a whole sequence for a name pair. Algorithm 1 is our Blocked Gibbs Sampling Algorithm. We use the last updated alignment model, which does not include the current to be sampled, to calculate the probability for each derivation of based on