Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014 (2014), Article ID 959328, 7 pages
http://dx.doi.org/10.1155/2014/959328
Research Article

Chinese Unknown Word Recognition for PCFG-LA Parsing

NLP2CT Laboratory, Department of Computer and Information Science, University of Macau, Macau

Received 30 August 2013; Accepted 10 March 2014; Published 9 April 2014

Academic Editors: J. Shu and F. Yu

Copyright © 2014 Qiuping Huang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This paper investigates the recognition of unknown words in Chinese parsing. Two methods are proposed to handle this problem. One is the modification of a character-based model. We model the emission probability of an unknown word using the first and last characters in the word. It aims to reduce the POS tag ambiguities of unknown words to improve the parsing performance. In addition, a novel method, using graph-based semisupervised learning (SSL), is proposed to improve the syntax parsing of unknown words. Its goal is to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The method is mainly to propagate lexical emission probabilities to unknown words by building the similarity graphs over the words of labeled and unlabeled data. The derived distributions are incorporated into the parsing process. The proposed methods are effective in dealing with the unknown words to improve the parsing. Empirical results for Penn Chinese Treebank and TCT Treebank revealed its effectiveness.