Abstract

In order to use semantics more effectively in natural language processing, a word sense disambiguation method for Chinese based on semantics calculation was proposed. The word sense disambiguation for a Chinese clause could be achieved by solving the semantic model of the natural language; each step of the word sense disambiguation process was discussed in detail; and the computational complexity of the word sense disambiguation process was analyzed. Finally, some experiments were finished to verify the effectiveness of the method.

1. Introduction

Currently, semantics is becoming more and more important in natural language processing. Scholars had made great progress in WSD research by analyzing the semantic relations.

Based on the semantic relevancy calculated according to HowNet, a WSD method was discussed [1]. A WSD algorithm which could disambiguate the word sense of the polysemy by the semantic relatedness in WordNet was proposed [2]. A two-stage WSD method was researched according to the semantic information on the Wiki [3]. Using the distance between words in the graph based model, a graph based WSD method was studied [4]. The Chinese sentence could be disambiguated based on HowNet in a question answering system [5]. A WSD algorithm based on the semantic relevancy in HowNet was researched [6]. A -pruning algorithm for semantic relevancy calculating model of natural language was studied [7]. WSD can be achieved by solving a model based on WordNet [8]. According to the semantic tree in WordNet, word sense could be disambiguated [9, 10].

Although the research had made considerable achievements, the WSD results were still not accurate enough in practice. In order to solve the problem more effectively and accurately, a WSD algorithm for the Chinese sentences was proposed. By the method, a Chinese clause could be disambiguated by analyzing the semantic relevancy. In the end, we verified the effectiveness of the method through some experiments.

2. The Basic Theory

2.1. The Semantic Relevancy Calculation Model

Suppose that each word (except for the predicate words) in a sentence () semantically describes another word ; the semantic relevancy between and could be represented by the correlation function .

Suppose there are kinds of parsing process for the sentence ; in the th parsing process , V are the predicate words, S are the subject words, and are the object words. The semantic relevancy of the sentence for can be expressed by formula (1), as shown in Figure 1:

In formula (1), is the number of words in (not including S, V, and O), is the weight coefficient, generally, should be proportional to the length of the sentence, and .

The Basic Principle of Model Solution. The most reasonable parsing process would be the parsing process which had the max semantic relevancy in all the kinds of parsing process.

In the calculation process, the grammatically partial words should be neglected.

2.2. The Basic Method to Solve the Model

According to the semantic structure, all the sentences in Chinese could be divided into two kinds:(i)the simple sentences: the sentences without subordinate sentences,(ii)the complex sentences: the sentences with subordinate sentences.

In the process to solve the model, a simple sentence might be selected, and resolute it to a word, and repeat the resolution process until the sentence becomes a simple sentence. And, in the resolution process, WSD could be finished.

3. The Word Sense Disambiguation Process

Most words in a Chinese sentence are polysemies; the WSD process could be solved by the following steps.

3.1. Get All the “-Sequences” for a Sentence

If a word () in a sentence is polysemy and one of the senses may be a verb or an adjective, the word W could be classified as “-Word.” “-Word” is a word that may be the predicate word () in the sentence.

Select all the “V-Words”; the other words remain unchanged; we can arrange all the possible “V-sequences” for the sentence. When a “V-Word” is arranged, no matter how many senses of the “V-Word” were, the word would be treated as two kinds: {V, a common vocabulary}. In mathematical theory, a sentence with V-Words” could be arranged in kinds of “V-sequence.” As an example, Figure 2 shows all the “V-sequences” for a sentence with 3 “V-Words.”

3.2. Get All the Simple Sentences for a “-Sequence”

Generally a simple clause contains only one “V-Word,” so it is easy to get all the simple sentences by the exhaustive method. As an example, Figure 3 shows all the simple sentences for a “V-Word ()” in a V-sequence.

In Figure 3, there might be kinds of the simple sentences for at most.

3.3. Get All the “SVO-Group” for a Simple Sentence

Get all the words which might be the subject words (S) or the object words (O) by calculating the semantic relevancy; if the value of is greater than the threshold, the word might be or O. It is easy to get all the “SVO-group” for a simple sentence, as shown in Figure 4.

3.4. Dividing a Simple Sentence into Segments

Generally, a sentence () could be divided into several segments as in Figure 5.

In Figure 5, L is the segment between S, V, and O, is prepositive attributive, is postpositive attributive, and is adverbial.

3.5. Turning the Segments into Some Simple Semantic Units

A segment between S, V, and O could be turned into several simple semantic units in semantic logic as in Figure 6.

For any simple semantic units, there would be the following semantic features:(i)for any word , is in the same simple semantic unit ( semantically describes another word );(ii)in the semantic analyzing process, a simple semantic unit could be treated as a whole, and its internal grammatical structure had no effect on the other analyzing process.

3.6. WSD for Simple Semantic Units

Most words are polysemies in natural language, so a semantic description relation graph (SDRG) for a simple semantic unit could be created for WSD. In the SDRG, all the senses of a polysemy are created as a “Generalized Vertex” and each sense is created as a vertex in the “Generalized Vertex” set.

As shown in Figure 7, is the th sense of and is the th sense of , and had been semantically described by ; then a directed edge between and should be created, and, in words list, a directed edge should be created at the same time.

In Figure 7, there were the following key features.(i)Except for the final “Generalized Vertex,” there is only one goal for any “Generalized Vertex” to describe, and the outdegree of any “Generalized Vertex” is 1.(ii)In each “Generalized Vertex,” all the edges of a spanning tree must connect to the same vertex.(iii)A SDRG, for each parsing method, must be a spanning tree of the complete graph of all the “Generalized Vertices.”

So, the best SDRG of the best parsing method for the simple semantic units must be the maximum spanning tree (MST) of the complete semantic description relation graph of all the “Generalized Vertices.” The specific details had been discussed in references [11].

3.7. Get the Best Simple Clause Resolution Sequence

According to formula (1), we could calculate the semantic relevancies of each simple clause and sum up all the values. There would be many different resolution sequences, so we should search and calculate each resolution sequence by exhaustive method during the calculation process. The resolution sequences with the best semantic relevancies are the best parsing method in semantics.

After Step 1 to Step 7, the semantic model could be solved and each polysemous word could be disambiguated.

4. The Computational Complexity

The key difficult problem is the computational complexity because the exhaustive methods are used in each step. Is the method too complex for calculating?

Suppose a sentence contained words and nV-Words,” each word is senses averagely; the time complexity for each step is analyzed as follows (Figure 9).

Step 1. Consider  ; each “V-Word” contains only 2 kinds of grammatical function.

Step 2. Consider ; because the average length between two “V-Words” is , the average length of a simple clause would be .

Step 3. Consider ; is a constant number. In theory, the time complexity is ; however, there would be less loss of accuracy if top- method was to be adopted.

Step 4. Consider ; a simple sentence might be divided into 4 segments except for SVO.

Step 5. Consider ; a segment might be turned into 3 simple semantic units.

Step 6. Consider ; a segment might be turned into 3 simple semantic units; the average length of semantic units is between and ; an approximate algorithm for the problem was discussed in [11].

Step 7. Consider ; there would be simple clauses, so the maximum kind of simple clause resolution sequence is .

The Time Complexity. Consider the following:

Averagely, is less than 5, is less than 5, and only the value of would be great, so the time complexity would not high enough for calculation in practice.

5. Experimental Results and Analysis

In the experiments, 200 Chinese sentences were selected and the HowNet was used as the lexical semantics library when calculating the semantic relevancy between two words (Windows XP; CPU: Xeon E5-2403, 2 GHz; memory: 8 G).

From the experimental results (Table 1), we can see the following.(i)The correct rates decrease with the length of the clause.(ii)The computational complexity increases with the length of the clause.(iii)The time of solving the semantic model is in the same order of ; this means that the computational complexity is in practice.

Using the same 200 Chinese sentences and the method in [1], we made some comparative experiments; the results are shown in Table 2.

In theory and practice, the correct rates would decrease with the length of the clause (Figure 8), but the author did not treat the different length of the clause in [1].

6. Summaries

In this paper, a word sense disambiguation method for Chinese based on semantics calculation was researched and WSD could be achieved by solving the semantic relevancy calculation model, and the relations between accuracy and the time complexity were explored by experiments. However, the experimental data was not enough and the accuracy was not high enough. These problems will be explored in the future research.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the Project of the National Characteristic Specialty of Computer Science and Technology (Grant no. 2009TS11576) and the Science and Technology Research Key Project of Education Department of Henan Province (Grant no. 13B520894).