Abstract

Globalization and multilingualism contribute to code-switching—the phenomenon in which speakers produce utterances containing words or expressions from a second language. Processing code-switched sentences is a significant challenge for multilingual intelligent systems. This study proposes a language modeling approach to the problem of code-switching language processing, dividing the problem into two subtasks: the detection of code-switched sentences and the identification of code-switched words in sentences. A code-switched sentence is detected on the basis of whether it contains words or phrases from another language. Once the code-switched sentences are identified, the positions of the code-switched words in the sentences are then identified. Experimental results show that the language modeling approach achieved an -measure of 80.43% and an accuracy of 79.01% for detecting Mandarin-Taiwanese code-switched sentences. For the identification of code-switched words, the word-based and POS-based models, respectively, achieved -measures of 41.09% and 53.08%.

1. Introduction

Increasing globalism and multilingualism has significantly increased demand for multilingual services in current intelligent systems [1]. For example, an intelligent traveling system which supports multiple language inputs and outputs can assist travelers in booking hotels, ordering in restaurants, and navigating attractions. Multinational corporations would benefit from developing automatic multilingual call centers to address customer problems worldwide. In such multilingual environments, an input sentence may contain constituents from two or more languages, a phenomenon known as code-switching or language mixing [26]. Table 1 lists several definitions of code-switching described in previous studies.

A code-switched sentence consists of a primary language and a secondary language, and the secondary language is usually manifested in the form of short expressions, such as words and phrases. This phenomenon is increasingly common, with multilingual speakers often freely moving from their native dialect to subsidiary dialects to entirely foreign languages, and patterns of code-switching vary dynamically with different audiences in different situations. When dealing with code-switched input, intelligent systems such as dialog systems must be capable of identifying the various languages and recognize the speaker’s intention embedded in the input [7, 8]. However, it is a significant challenge for intelligent systems to deal with multiple languages and unknown words from various languages.

In Taiwan, while Mandarin is the official language, Taiwanese and Hakka are used as a primary language by more than 75% and 10% of the population, respectively [9]. Moreover, English is the most popular foreign language and compulsory English instruction begins in elementary school. The constant mix of these languages result in various kinds of code-switching, such as Mandarin sentences mixed with words and phrases from Taiwanese, Hakka, and English. Such code-switching is not limited to everyday conversation but can frequently be heard on television dramas and even current events commentary programs. This paper takes a linguistic view towards the problem of code-switching language processing, focusing on code-switching between Mandarin and Taiwanese. We propose a language modeling approach which divides the problem into two subtasks: the detection of code-switched sentences followed by identification of code-switched words within the sentences. The first step detects whether or not a given Mandarin sentence contains Taiwanese words. Once a code-switched sentence is identified, the positions of the code-switched words are then identified within the sentence. These code-switched words can be used for lexicon augmentation to improve understanding of code-switched sentences.

The rest of this work is organized as follows. Section 2 presents related work. Section 3 describes the language modeling approach to the identification of code-switched sentences and words in the sentences. Section 4 summarizes the experimental results. Conclusions are finally drawn in Section 5, along with recommendations for future research.

Research on code-switching speech processing mainly focuses on speech recognition [914], language identification [15, 16], text-to-speech synthesis [17], and code-switching speech database creation [18]. Lyu et al. proposed a three-step data-driven phone clustering method to train an acoustic model for Mandarin, Taiwanese, and Hakka [9]. They also discussed the issue of training with unbalanced data. Wu et al. proposed an approach to segmenting and identifying mixed-language speech utterances [10]. They first segmented the input speech utterance into a sequence of language-dependent segments using acoustic features. The language-specific features were then integrated in the identification process. Chan et al. developed a Cantonese-English mixed-language speech recognition system, including acoustic modeling, language modeling, and language identification algorithms [11]. Hong et al. developed a Mandarin-English mixed-language speech recognition system in resource-constrained environments, which can be realized in embedded systems such as personal digital assistants (PDAs) [12]. Ahmed and Tan proposed a two-pass code-switching speech recognition framework: automatic speech recognition and rescoring [13]. Vu et al. recently developed a speech recognition system for code-switching in conversational speech [14]. For language identification, Lyu et al. proposed a word-based lexical model integrating acoustic, phonetic, and lexical cues to build a language identification system [15]. Yeong and Tan proposed the use of morphological structures and sequence of the syllable for language identification from Malay-English code-switching sentences [16]. For speech synthesis, Qian et al. developed a text-to-speech system that can generate Mandarin-English mixed-language utterances [17].

Research on code-switching and multilingual language processing included applications of text mining [1922], information retrieval [2325], ontology-based knowledge management [26], and unknown word extraction [27]. For text mining, Seki et al. extracted opinion holders for discriminating opinions that are viewed from different perspectives (author and authority) in both Japanese and English [19]. Yang et al. used self-organizing maps to cluster multilingual documents [20]. A multilingual Web directory was then constructed to facilitate multilingual Web navigation. Zhang et al. addressed the problem of multilingual sentence categorization and novelty mining on English, Malay, and Chinese sentences [21]. They proposed to first categorize similar sentences and then identify new information from them. De Pablo-Sánchez et al. devised a bootstrapping algorithm to acquire named entities and linguistic patterns from English and Spanish news corpora [22]. This lightly supervised method can acquire useful information from unannotated corpora using a small set of seeds provided by human experts. For information retrieval, Gey et al. pointed out several directions for cross-lingual information retrieval (CLIR) research [23]. Tsai et al. used the FRank ranking algorithm to build a merge model for multilingual information retrieval [24]. Jung discovered useful multilingual tags annotated in social texts [25]. He then used these tags for query expansion to allow users to query in one language but obtain additional information in another language. For other application domains, Segev and Gal proposed an ontology-based knowledge management model to enhance portability and reduce costs in multilingual information systems deployment [26]. Wu et al. proposed the use of mutual information and entropy to extract unknown words from code-switched sentences [27].

3. Language Modeling Approach

Language modeling approaches have been successfully used in many applications, such as grammar error correction [28], code-switching language processing [29], and lexical substitution [3032]. For our task, a code-switched sentence generally has a higher probability of being found in a code-switching language model than in a noncode-switching one. Thus, we built code-switching and noncode-switching language models to compare their respective probabilities of identifying code-switched sentences and code-switched words within the sentences. Figure 1 shows the system framework. First, a corpus of code-switched and noncode-switched sentences is collected to build the respective code-switching and noncode-switching language models. To identify code-switched sentences, we compare the probability of each test sentence output by the code-switching language model against the output of the noncode-switching one to determine whether or not the test sentence is code-switched. To identify code-switched words within the sentences, we select the -gram with the highest probability output by the code-switching language model and then compare it against the output of the noncode-switching one to verify whether the th word in the given sentence is a code-switched word.

3.1. Corpus Collection

A noncode-switching corpus refers to a set of sentences containing just one language. Because Mandarin is the primary language in this study, we used the Sinica corpus released by the Association for Computational Linguistics and Chinese Language Processing (ACLCLP) as the noncode-switching corpus. A code-switching corpus refers to a set of Mandarin sentences featuring Taiwanese words. However, it can be difficult to collect a large number of such sentences, and training a language model on insufficient data may incur the data sparseness problem. Therefore, we used more common Mandarin-English sentences as the code-switching corpus, based on the assumption that the code-switching phenomenon in Mandarin-English sentences has a certain degree of similarity to Mandarin-Taiwanese sentences, because in Taiwan, both English and Taiwanese are secondary languages with respect to Mandarin. The Mandarin-English sentences were collected from a large corpus of web-based news articles, which were then segmented using the CKIP word segmentation system developed by the Academia Sinica, Taiwan (http://ckipsvr.iis.sinica.edu.tw/) [33, 34]. The sentences containing words with the part-of-speech (POS) tag “FW” (i.e., foreign word) were selected as code-switched sentences.

3.2. Detection of Code-Switched Sentences

Generally, an -gram language model is used to predict the th word based on the previous words using a probability function . Given a sentence , the noncode-switching -gram language model is defined as where is estimated by where denotes the frequency counts of the -grams retrieved from the noncode-switching corpus (i.e., Sinica corpus). Instead of estimating the surface form of the next word, the code-switching -gram language model estimates the probability that the next word is a code-switched word, that is, , defined as where is estimated by To estimate , the code-switching corpus is processed by replacing the code-switched words (i.e., the words with the POS tag “FW”) in the Mandarin-English sentences with a special character . The frequency counts of can then be retrieved from the code-switching corpus. This processing may also reduce the effect of the data sparseness problem in language model training.

Once the two language models are built, they can be compared to detect whether a given sentence contains code-switching. That is, The sentence is predicted to be a code-switched sentence if the probability of the sentence output by the code-switching language model is greater than that output by the noncode-switching one (i.e., ).

3.3. Identification of Code-Switched Words

This step identifies the positions of the code-switched words within the sentences. To this end, the code-switching -gram language model (3) is applied to each test sentence and the probability of being a code-switched word is assigned to every next word (position) in the sentence. Among all the -grams in the sentence, the one with the highest probability indicates the most likely position of a code-switched word. That is, where denotes the best hypothesis of the code-switched word in the sentence. However, not all -grams with the highest probability suggest correct positions. Therefore, we further propose a verification mechanism to determine whether to accept the best hypothesis. That is, where represents the probability of the best hypothesis in the code-switching corpus and represents its probability in the noncode-switching corpus. The best hypothesis is accepted if its probability in the code-switching corpus is greater than that in the noncode-switching corpus.

4. Experimental Results

This section first explains the experimental setup, including experiment data, implementation of language modeling, and evaluation metrics. We then present experimental results for the identification of both Mandarin-Taiwanese and Mandarin-English code-switched sentences and words within the sentences.

4.1. Experimental Setup

The test set included 393 sentences of which 131 were Mandarin only (i.e., noncode-switched), while another 131 were Mandarin sentences containing Taiwanese words, and the remaining 131 were Mandarin sentences containing English words. For the evaluation of Mandarin-Taiwanese sentences, -gram models for both code-switching and noncode-switching were trained using the SRILM toolkit [35] with and 3 (i.e., bigram and trigram). For the evaluation of Mandarin-English sentences, the CKIP word segmentation system [33, 34] was used because it can associate a POS tag “FW” to English words/characters within the sentences. The evaluations metrics included recall, precision, -measure, and accuracy. The recall was defined as the number of code-switched sentences correctly identified by the method divided by the total number of code-switched sentences in the test set. The precision was defined as the number of code-switched sentences correctly identified by the method divided by the number of code-switched sentences identified by the method. The -measure was defined as (2 × recall × precision)(recall + precision). The accuracy was defined as the number of sentences correctly identified by the method divided by the total number of sentences in the test set.

4.2. Results
4.2.1. Evaluation on Mandarin-Taiwanese Code-Switched Sentences

To identify Mandarin-Taiwanese code-switched sentences, the code-switching and noncode-switching bigram/trigram language models were used to determine whether or not each test sentence features code-switching (5), with results presented in Table 2. The bigram language model correctly identified 113 code-switched sentences and 94 noncode-switched sentences, thus yielding 86.26% (113/131) recall, 75.33% (113/150) precision, 80.43% -measure, and 79.01% (207/262) accuracy. The trigram language model, however, did not outperform the bigram model, possibly due to the data sparseness problem caused by a lack of sufficient training data for building the trigram language model.

To identify code-switched words in Mandarin-Taiwanese code-switched sentences, all word bigrams and trigrams in each test sentence were first ranked according to their probabilities. The top word bigrams/trigrams were then selected as candidates for further verification using (7). For instance, top 1 means that the bigram/trigram with the highest probability in a given test sentence is considered a candidate. If the candidate -gram is accepted by the verification method, then the position indicated by the -gram will be considered a foreign word. Similarly, top 2 means that the method can propose two candidates for verification. To examine the effect of the data sparseness problem, we used the part-of-speech (POS) tags of words to build additional POS bigram/trigram models from the code-switching corpus. In addition to the word/POS -gram models, we also implemented a baseline system to randomly guess the positions of code-switched words in the sentences, and the top , herein, means that the system can randomly propose candidate positions. Table 3 shows the results for the identification of code-switched words.

The results show that the -measure of the baseline system (Random) was only around 18~25%, indicating that identifying code-switched words is more difficult than identifying code-switched sentences. In addition, the proposed word/POS -gram models significantly outperformed Random. For the word-based -gram models, the word bigram model achieved an -measure of around 41%, which was much better than that of both the word trigram model and Random. Once the POS tags were used to build the language models, both the POS bigram and trigram models outperformed their corresponding word-based models in terms of -measure, as well as for recall and precision. This finding indicates that training with the POS tags can reduce the impact of the data sparseness problem. In addition, as shown in Figure 2, the accuracy improvement derived from the trigram model was significantly greater than that from the bigram model, because the trigram model tends to suffer from a more serious data sparseness problem than the bigram model when training data is insufficient. Overall, the best performance of the POS -gram models was achieved at an -measure of 53.08% (POS trigram, top 1).

Code-switched word identification can also be evaluated by allowing the methods to propose more than one candidate, that is, top 1 to top 3. Table 3 shows that, with more candidates included for verification, more code-switched words were correctly identified, thus dramatically increasing the recall of all methods, but at the cost of reduced precision. Overall, the -measure of top 2 was increased for all methods except for the POS trigram, but for top 3, increasing the number of candidates only increased the -measure of Random and word trigram.

4.2.2. Evaluation on Mandarin-English Code-Switched Sentences

To identify code-switched words in Mandarin-English code-switched sentences, the words associated with the POS tag “FW” (representing a foreign word) by the CKIP word segmentation system were proposed as the answers. The Random system was also implemented to guess the English words in the test sentences. Table 4 shows the comparative results. As expected, the CKIP word segmentation system can provide very precise information for identifying English words in sentences, thus yielding very good performance. Actually, the CKIP system has been under development for over ten years and is still updated periodically. For the Random system, the -measure was around 19~27% which was similar to that (18~25%, Table 3) for code-switched word identification in Mandarin-Taiwanese code-switched sentences.

5. Conclusions

This work presents a language modeling method for identifying sentences featuring code-switching and for identifying the code-switched words within those sentences. Experimental results show that the language modeling approach achieved an -measure of 80.43% and an accuracy of 79.01% for the detection of Mandarin-Taiwanese code-switched sentences. For the identification of code-switched words in Mandarin-Taiwanese code-switched sentences, the POS -gram models outperformed the word -gram models, mainly because of the reduced impact of the data sparseness problem. The highest -measures (top 1) for the word-based and POS-based models were 41.09% and 53.08%, respectively. For code-switched word identification in Mandarin-English code-switched sentences, the CKIP word segmentation system achieved very high performance (95.02% -measure).

Future work will focus on improving system performance by incorporating other effective machine learning algorithms and features, such as sentence structure analysis. The proposed method could also be integrated into practical applications such as a multilingual dialog system to improve effectiveness in dealing with the code-switching problem.

Acknowledgments

This research was partially supported by the National Science Council (NSC), Taiwan, under Grant nos. NSC 99-2221-E-155-036-MY3, NSC 100-2511-S-003-053-MY2, and NSC 102-2221-E-155-029-MY3, and the “Aim for the Top University Project” of National Taiwan Normal University (NTNU), sponsored by the Ministry of Education, Taiwan. The authors are also grateful to the support of the International Research-Intensive Center of Excellence Program of NTNU and NSC, Taiwan, under Grant no. NSC 102-2911-I-003-301.