Abstract

With the spurt of online user-generated contents on web, sentiment analysis has become a very active research issue in data mining and natural language processing. As the most important indicator of sentiment, sentiment words which convey positive and negative polarity are quite instrumental for sentiment analysis. However, most of the existing methods for identifying polarity of sentiment words only consider the positive and negative polarity by the Cantor set, and no attention is paid to the fuzziness of the polarity intensity of sentiment words. In order to improve the performance, we propose a fuzzy computing model to identify the polarity of Chinese sentiment words in this paper. There are three major contributions in this paper. Firstly, we propose a method to compute polarity intensity of sentiment morphemes and sentiment words. Secondly, we construct a fuzzy sentiment classifier and propose two different methods to compute the parameter of the fuzzy classifier. Thirdly, we conduct extensive experiments on four sentiment words datasets and three review datasets, and the experimental results indicate that our model performs better than the state-of-the-art methods.

1. Introduction

With the advent of Web2.0, user-generated contents such as product reviews, status on social networking services, and microblogs are exploding in the Internet. The growing availability of subjective contents makes extracting useful sentiment information from subjective contents a hot topic in natural language processing, web mining, and data mining [1]. Since early 2000, as a special case of text classification, sentiment classification has attracted attention from increasing number of researchers and become a very active research area [2]. Sentiment words play a more important role in mining subjective contents than in mining objective contents [3]. As the prerequisite of sentiment classification, identifying the polarity of sentiment words is a key issue in sentiment classification.

At present, there are mainly two types of methods in identifying the polarity of English sentiment words. One is based on corpus; the other is based on thesaurus. These two methods are also widely used for identifying the polarities of Chinese sentiment words. There are mainly three steps in the two types of methods. The first step is to compute similarity between sentiment words and positive reference words. The second step is to compute similarity between sentiment words and negative reference words. The third step is to compare the two similarities based on the Cantor set and acquiring the polarity of sentiment words [4].

The existing two types of methods simply divide sentiment words into two classes, that is, positive or negative, without regarding polarity intensity of sentiment words and fuzziness of polarity intensity of sentiment words [5]. Actually, different sentiment words belonging to the same polarity have different polarity intensity. For example, “laugh” has a larger intensity than “smile.” In order to distinguish polarity intensity of different sentiment words, researchers have proposed some methods to identify polarity of Chinese sentiment words based on Chinese morphemes [610]. This type of methods hypothesizes that words are function of its component morphemes and can improve performance [8]. In a certain extent, this type of methods overcomes the shortcomings that the polarity intensity of sentiment words is not considered in identifying polarity of Chinese sentiment words, but the fuzziness of the polarity intensity of sentiment words is still not considered in this type of methods. Due to the fuzziness of natural language and sentiment category, we should adopt a fuzzy set to describe polarity of sentiment words instead of the Cantor set [11].

In order to overcome the above shortcomings and to improve the accuracy as best as we can, we propose a fuzzy computing model to identify the polarity of Chinese sentiment words. Our model mainly includes two parts: one is calculating polarity intensity of sentiment morphemes and sentiment words; the other is constructing a fuzzy classifier and computing parameter of the fuzzy classifier. The contribution of this paper is mainly embodied in three aspects.

Firstly, based on the three existing Chinese sentiment lexicons, we constructed an unambiguous key sentiment lexicon and a key sentiment morpheme set. Then, we proposed a method to compute the sentiment intensity of sentiment morphemes and sentiment words using the constructed sentiment lexicon and sentiment morpheme set.

Secondly, considering the fuzziness of sentiment intensity, we constructed a fuzzy sentiment classifier and a corresponding classification function of the fuzzy classifier by virtue of fuzzy sets theory and the principle of maximum membership degree. In order to improve the performance, we further proposed two different methods to learn parameters of the fuzzy sentiment classifier.

Thirdly, we constructed four sentiment words datasets to demonstrate the performance of our model. At the same time, we proved that our model performs better than several state-of-the-art methods by applying our model to sentiment classification on three review datasets.

This paper is organized as follows. We introduce related work in Section 2. Section 3 introduces the fuzzy computing model and two key parts of the model. In Section 4, we firstly build a key sentiment lexicon, a key sentiment morpheme set, and four sentiment word datasets and then conduct some experiments to verify performance of the fuzzy computing model. Finally, we summarize this paper, draw corresponding conclusions, and figure out future research direction in Section 5.

Sentiment classification is a hot topic in natural language processing and web mining. There are a large number of research papers about sentiment classification since 2002 [1, 2]. Existing methods are mainly divided into two categories: machine learning methods and semantic orientation aggregation [12]. The machine learning methods include many traditional text classification methods [13], such as naive Bayes [14], support vector machine [15], and neural networks [16]. The second strategy uses sentiment words to classify features into positive and negative categories and then aggregates the overall orientation of a document [17, 18].

As a basic requirement of sentiment classification, identifying the polarity of sentiment words is a research focus which has been focused on for many years. There are mainly three types of methods in identifying polarity of Chinese sentiment words. The first is thesaurus-based method which computes similarity between reference words and the given sentiment words by distance in thesaurus. The second is corpus-based method which computes similarity between reference words and the given sentiment words by statistic method in corpus. The third is morpheme-based method which computes polarity of sentiment words by combining polarity of Chinese morpheme.

Thesaurus-based method acquires sentiment words mainly by synonyms, antonyms, and hierarchies in thesaurus such as WordNet and HowNet [1922]. These methods use some seed sentiment words to bootstrap via synonym and antonym relation in a thesaurus. Kamps et al. computed polarity of sentiment words according to the distance between sentiment words and reference seed words in WordNet [23, 24]. Esuli and Sebastiani used glosses of words to generate a feature vector and computed polarity of sentiment words with a supervised learning classifier in thesaurus [25, 26]. Dragut et al. proposed a bootstrapping method according to a set of inference rules to compute sentiment polarity of words [27].

The kernel of corpus-based method is to calculate similarity between reference words and sentiment words in corpus. This approach has an implied hypothesis that sentiment words have the same polarity with the reference words of the greatest cooccurrence rate and the opposite polarity with the reference words of the least cooccurence rate. The polarity of sentiment words is assigned by computing cooccurrence in corpus [2835].

The most classic method is point mutual information method proposed by Turney [30, 31]. This method computes the polarity of a given sentiment word by subtracting the mutual information of its association with a set of negative sentiment words from the mutual information of its association with a set of positive sentiment words. The result of mutual information depends on statistic result in a given corpus.

Different from Turney, Yu and Hatzivassiloglou used more seed words and log-likelihood ratio to compute the similarity [32]. Kanayama and Nasukawa used a set of linguistic rules in intrasentence and intersentence to identify polarity of sentiment words from the corpus [34]. Huang et al. proposed an automatic construction of domain-specific sentiment lexicon [28]. Ghazi et al. used sentential context to identify the contextual-sensitive sentiment words [29].

Some researchers calculated polarity of sentiment words by combining corpus with thesaurus [36, 37]. Xu et al. presented a method to capture polarity of sentiment words by using a graph-based algorithm and multiple resources [36]. Peng and Park computed polarity of sentiment words by constrained symmetric nonnegative matrix factorization [37]. This method finds out a set of candidate sentiment words by bootstrapping in dictionary and then uses a large corpus to assign sentiment polarity scores to each word.

Taking into consideration the characteristics of Chinese character, some researchers proposed morpheme-based methods [610]. Based on Turney’s work, Yuen et al. proposed a method by calculating similarity between reference morphemes and sentiment words in corpus to get the polarity of sentiment words [6]. Experimental results demonstrated better performance than Turney’s method in identifying polarity of Chinese sentiment words. Ku et al. proposed a bag-of-characters method, which computed polarity intensity of sentiment words based on morpheme by statistic and then compared polarity intensity of sentiment words with a single threshold “0” to identify polarity of sentiment words [10]. Ku et al. considered eight types of morphemes and built a classifier based on machine learning for Chinese word-level sentiment classification [8]. They showed that using word structural features can improve performance in word-level sentiment classification.

The existing three types of methods are based on a common hypothesis that the polarity of sentiment words is certainty. But some researches have proven that polarity of sentiment words had some fuzziness to some extent [5]. So, it is not suitable to identify polarity of sentiment words by either-or methods. To this end, we propose a fuzzy computing model to identify polarity of Chinese sentiment words.

Some researches on fuzzy set have been applied to sentiment classification. These researches mainly focus on document-level and sentence-level sentiment classification [9, 12]. For example, Wang et al. proposed an ensemble learning method to predict consumer sentiment by online sequential extreme learning machine and intuitionist fuzzy set, which is a supervised method [12]. Fu and Wang together invented an unsupervised method using fuzzy sets for sentiment classification of Chinese sentences [9]. Different from the above methods, we focus on word-level sentiment classification and propose a fuzzy computing model, which is an unsupervised framework to identify the polarity of Chinese sentiment words.

3. Fuzzy Computing Model

3.1. General Framework

In existing methods of identifying the polarity of sentiment words, sentiment words are divided into two classes—positive or negative by Cantor set. The fuzziness of polarity intensity of sentiment words is not considered. In order to overcome the shortcomings and improve the accuracy, we proposed a fuzzy computing model (FCM) for identifying polarity of Chinese sentiment words. The general framework of FCM is described in Figure 1.

Notations KSL, KMS, , , , , , , , and in Figure 1 are defined in Notations section.

The general framework of FCM consists of three sections: sentiment words datasets, a key sentiment lexicon (KSL) and a key sentiment morpheme set (KMS), and the central FCM. Sentiment words datasets are test datasets for verifying the performance of FCM when identifying the polarity of Chinese sentiment words. KSL and KMS are the basic thesauruses of FCM. KSL consists of a positive key sentiment words list () and a negative key sentiment words list (). The central FCM is composed of two key parts.

The first part includes computing polarity intensity of sentiment morpheme in KMS, computing polarity intensity of sentiment word in KSL, and computing polarity intensity of sentiment word in sentiment words datasets. We compute based on the frequency of sentiment morpheme appearing in and the frequency of sentiment morpheme appearing in . After getting , we divide each sentiment word into morphemes and compute and based on .

The second part is to construct a classification function of fuzzy classifier and computing parameter in . Firstly, we define fuzzy set and membership function of fuzzy set for positive or negative categories. Secondly, based on the principle of maximum membership degree, we construct . Thirdly, we propose two different methods based on average polarity intensity of sentiment words (APIOSW) in different sentiment word datasets and APIOSW in KSL to determine . Then, we describe the two key components of FCM in detail.

3.2. Computing Polarity Intensity of Sentiment Morphemes and Sentiment Words

Based on KSL and KMS, we calculate in KMS. With available, we compute in KSL and in sentiment words datasets. There are mainly three steps in the whole computational process.(1)Firstly, for each sentiment morpheme in KMS, we compute positive frequency of appearing in and negative frequency of appearing in according toHere is the frequency of appearing in and is the frequency of appearing in . is the number of positive sentiment words that contain the morpheme and is the number of negative sentiment words that contain the morpheme . is the number of sentiment words in and is the number of sentiment words in .(2)Secondly, for each sentiment morpheme in KMS, we use percentage of in and to compute positive polarity intensity, negative polarity intensity, and polarity intensity byHere is positive polarity intensity, is negative polarity intensity, and is polarity intensity. is the frequency of appearing in and is the frequency of appearing in .(3)Thirdly, for each sentiment word in KSL and in sentiment words datasets, we calculate and based on polarity intensity of byHere number(, ) is the number of morphemes included in sentiment words ; number(, ) is the number of morphemes included in sentiment words .

3.3. Constructing Fuzzy Classifier and Computing Parameter of Fuzzy Classifier

After getting , , and , we firstly define two membership functions of the fuzzy classifier for positive and negative categories. Secondly, we build a classification function of the fuzzy classifier by principle of maximum membership degree. Thirdly, we determine an optimum fixed parameter of the fuzzy classifier by experimenting on KSL. Fourthly, we compare APIOSW in different sentiment word datasets with APIOSW in KSL to get different optimum parameters of fuzzy classifier for each sentiment word dataset.

3.3.1. Defining Membership Function of Fuzzy Classifier

In order to identify polarity of sentiment words based on FCM, we choose semitrapezoid distribution to define membership function of positive category and negative category in fuzzy classifier by

Here is sentiment word, is polarity intensity of , and , are adjustable parameters which decide the region and the shape of member function of the fuzzy classifier.

We need to set value for parameters , in membership function of the fuzzy classifier to identify polarity of sentiment words. Actually, we do not set value for the two parameters , in FCM but simplify the two parameters to one parameter by defining .

3.3.2. Constructing Classification Function of Fuzzy Classifier

After computing polarity intensity of sentiment words, based on membership function in (4), we confirm polarity of sentiment words according to the principle of maximum membership degree. At last, we get the following equation as the classification function of the fuzzy classifier to identify polarity of sentiment words:

Here is polarity intensity of sentiment word . We can determine polarity of sentiment words only by setting value for parameter .

3.3.3. Setting Parameter in Classification Function of the Fuzzy Classifier

In order to determine the parameter in classification function of the fuzzy classifier, we propose two different methods. One is fixed parameter method where the parameter is selected by experimenting on KSL; the other is variable parameter method where the parameter is selected by comparing APIOSW of different sentiment word datasets with APIOSW of KSL. The two methods are described as follows.

(1) Fixed Parameter Method. The parameter controls the threshold of identifying polarity of sentiment words in fuzzy classifier. We can choose parameter by experimenting on KSL. We compare performance of FCM with different values of parameter and then choose the best-performance parameter as fixed parameter in FCM. The experimental results and specific parameter are shown in Figure 2.

(2) Variable Parameter Method. Referring to experiment results in KSL to estimate the value of parameter , we can only get a local optimal parameter in KSL. In order to get global optimal parameter in different sentiment word datasets, we propose a variable parameter method. The specific method is described as follows.

For each sentiment word datasets () which consist of positive sentiment words list and negative sentiment words list , we define in

Here is the number of sentiment words in , is the number of sentiment words in , and is polarity intensity of sentiment word in . Similar to , we define APIOSW of KSL in

Here is the number of sentiment words in , is the number of sentiment words in KSL, and is polarity intensity of sentiment word in KSL.

For each , we calculate the difference of APIOSW between and KSL. Based on the difference and fixed parameter in KSL, we adjust the parameter of to get different optimum parameter in each . The special method is described in

Here parameter is a fixed parameter which is got through the fixed parameter method above.

4. Performance Evaluation

To verify the performance of FCM by experiment, we firstly construct KSL and KMS. Secondly, we construct four sentiment word datasets as test datasets and choose classification indicator: precision, recall, measure, and accuracy as metric to evaluate the performance of baseline methods and FCM. Thirdly, we do experiments on KSL and compare APIOSW in different sentiment word datasets with APIOSW in KSL to find the optimum parameter. Fourthly, we compare the performance of different methods and prove the efficiency of FCM. Fifthly, we discuss the influence of parameter on accuracy of FCM. Sixthly, in order to demonstrate the effect of our methods in a real task, we apply our methods and baseline methods to sentiment classification of review. The experimental results prove the validity of our method.

4.1. Constructing KSL and KMS

Based on Chinese sentiment lexicons—Tsinghua University sentiment lexicon (TUSL), National Taiwan University sentiment lexicon (NTUSL), and HowNet, we construct KSL. When constructing KSL, we have an implicit assumption that there are some sentiment words whose polarity is ambiguous among different sentiment lexicons. We define TUSL as , NTUSL as , and HowNet as . Given above which consists of and , we get some sentiment words whose polarity is ambiguous. Table 1 presents the number of sentiment words whose polarity is fuzzy between and .

From Table 1, we can see that there are some sentiment words whose polarity is ambiguous between and , which proves the assumption that polarity of sentiment words is not always consistent within different sentiment lexicons. So, we delete these sentiment words whose polarity is ambiguous from and . Finally, we construct KSL by choosing the sentiment words which is at least contained in two sentiment lexicons and unambiguous in polarity. The method of constructing KSL is shown in

Here , , is positive sentiment words list of , and is negative sentiment words list of . We compute KSL in (9). The number of sentiment words in and KSL is shown in Table 2.

For KSL, we delete words whose length is greater than two and then split the remaining words into morphemes. Finally, we put morphemes together to construct a KMS.

4.2. Experimental Setting

In order to verify the performance of our model, we build four sentiment word datasets based on TUSL, NTUSL, and HowNet. To ensure that the polarity of sentiment words is unambiguous in the four sentiment word datasets, we delete the sentiment words whose polarity is ambiguous among the three sentiment lexicons. In order to ensure that sentiment words in the four sentiment word datasets is independent of sentiment words in KSL, we delete the sentiment words in KSL from the three sentiment lexicons. The specific method is described as follows.

For each which consists of and , we build sentiment words dataset1, dataset2 and dataset3 according to

With the same method, we construct a much larger datasets4 in

Each sentiment word dataset consists of positive sentiment words list and negative sentiment words list . At last, we get four sentiment word datasets (http://203.91.121.76/Datasets/) which are summarized in Table 3.

Since our task is identifying polarity of sentiment words, therefore, we choose classification indicator: precision (), recall (), measure, and accuracy () as metric. The four indicators are defined as follows:

Here , , , and are defined in Table 4.

To evaluate the overall performance of our model in identifying the polarity of Chinese sentiment words, we compare our model with thesaurus-based method and morpheme-based method in four different sentiment word datasets. Our model and baseline methods are depicted as follows:MBOT: the method based on thesaurus [27];MBOM: the method based on morpheme [10];FCMWFP: the fuzzy computing model with fixed parameter, which is described in Section 3.3.3;FCMWVP: the fuzzy computing model with variable parameter, which is shown in Section 3.3.3.

In order to further demonstrate the effect of our methods and highlight the contribution of our work, we design the following experiments. Firstly, for each sentiment word dataset, we construct four different sentiment lexicons where the polarities of sentiment words are different in each method. Secondly, we choose three Chinese review datasets, which are provided by Songbo Tan (http://203.91.121.76/Datasets/). Each review dataset () consists of both positive reviews and negative reviews. The basic statistics of these three review datasets are summarized in Table 5. Thirdly, for each sentiment word dataset, we compare sentiment classification results of three review datasets based on four different sentiment lexicons, which correspond to four different methods. These sentiment lexicons are described as follows:: the sentiment lexicon corresponding to MBOT and sentiment word dataset ;: the sentiment lexicon corresponding to MBOM and sentiment word dataset ;: the sentiment lexicon corresponding to FCMWFP and sentiment word dataset ;: the sentiment lexicon corresponding to FCMWVP and sentiment word dataset .

We conduct extensive experiments in four sentiment word datasets and three review datasets to solve four problems.(1)Discuss how to set parameter in classification function of fuzzy classifier.(2)Study performance of our model in identifying polarity of Chinese sentiment words.(3)Analyse effect of different parameter on accuracy of our model.(4)Validate the effect of sentiment lexicons created by our methods in sentiment classification of documents.

4.3. Setting Parameter in Classification Function of Fuzzy Classifier

In FCM, parameter in classification function of fuzzy classifier needs to be set. We conducted experiment on KSL to find the optimal value of parameter . Figure 2 shows performance of MBOM and FCMWFP for different parameter .

From Figure 2, we can see that when parameter is selected near 0.05, performance of FCMWFP is the best. So we choose in FCMWFP. After choosing the fixed value of parameter , according to Section 3.3.3, we calculate APIOSW of different datasets by (6) and (7). Finally, we compute value of parameter by (8). Table 6 summarizes the APIOSW and of KSL and four sentiment word datasets.

4.4. Performance of Different Methods in Identifying Polarity of Chinese Sentiment Words

In order to verify performance of FCM, we choose MBOT and MBOM as baselines. FCM consists of FCMWFP and FCMWVP. We compared FCM with MBOT and MBOM in four sentiment word datasets. Experimental results are shown in Table 7 and Figure 3.

From Table 7 and Figure 3, we can see that accuracy of MBOT is slightly higher than accuracy of MBOM. The average accuracies of MBOT and MBOM are 0.7351 and 0.7327 in four datasets, while the average accuracies of our model are 0.7661 and 0.7835. FCMWFP improves by about 4.6% accuracy than MBOM in identifying polarity of Chinese sentiment words. FCMWVP improves by about 6.9% accuracy than MBOM in identifying polarity of Chinese sentiment words. This demonstrates that our model is more effective than MBOM and MBOT in identifying polarity of Chinese sentiment words.

At the same time, we can see that FCMWVP has better performance than FCMWFP in our model, which validates our assumption that we can only acquire local optimum by FCMWFP, but we can get approximate global optimum by FCMWVP.

4.5. Accuracy of FCM with Different Parameter

In order to explore the effect of the different parameter on accuracy of FCM in identifying polarity of Chinese sentiment words, we do some experiments using different parameter in four sentiment word datasets. The results are shown in Figure 4.

From Figure 4, we can see that different values of parameter in FCM have different effect on the accuracy of FCM. When we choose suitable parameter , FCM always achieves higher accuracy than MBOT and MBOM in identifying polarity of Chinese sentiment words.

4.6. Performance of Different Sentiment Lexicons in Sentiment Classification of Chinese Reviews

In order to further verify the feasibility of our methods, we applied four different sentiment lexicons to sentiment classification of Chinese reviews. For each sentiment word dataset, we compared sentiment classification results of three review datasets based on four different sentiment lexicons. Experimental results are shown in Tables 8, 9, 10, and 11.

From Tables 8, 9, 10, and 11, we can see that accuracies of and are higher than accuracies of and in sentiment classification of Chinese review. The results prove that the methods which consider fuzzy sentiment are more effective than those methods that consider only either-or sentiment.

At the same time, we can see that had better performance than in sentiment classification of Chinese reviews, which proves that our method based on variable parameters was more efficient than our method based on fixed parameter.

5. Conclusion

In this paper, we propose a fuzzy computing model for identifying polarity of Chinese sentiment words by combining polarity intensity of Chinese morpheme with fuzzy set theory. Based on the assumption that Chinese sentiment words are a function of Chinese morpheme, we compute polarity intensity of sentiment words with known polarity intensity of morphemes. After studying the three existing sentiment lexicon, we find that there is fuzziness among some of the sentiment words; that is to say, some sentiment words have different sentiment polarities in different lexicons. We define polarity of sentiment words as fuzzy set and identify polarity of sentiment words by the principle of maximum membership degree. In order to verify performance of our model, we build four sentiment word datasets. We compare our model with baseline methods in four sentiment word datasets. Experimental results prove that our model had better performance than the state-of-the-art methods.

Our methods suggest several possible research directions. Due to fuzziness of sentiment polarity in natural language, we can deal with sentiment analysis problem based on fuzzy set theory. Our model demonstrates the effectiveness of fuzzy computing in sentiment words classification. Next, we plan to apply fuzzy set theory to sentence-level sentiment classification and document-level sentiment classification.

Notations

:Key sentiment morpheme
:Key sentiment lexicon
: Sentiment morpheme in KMS
: Sentiment word in KSL
: Polarity intensity of sentiment morpheme in KMS
: Polarity intensity of sentiment word in KSL
: Sentiment word in sentiment words datasets
: Polarity intensity of sentiment word in sentiment words datasets
: Parameter in classification function of fuzzy classifier
: Classification function of fuzzy classifier.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank Qingqing Zheng, Qing Cheng, and the anonymous reviewers for helpful comments. This work was supported in part by the National Natural Science Foundation of China (U1405254), National Natural Science Foundation of China (61472092, 61402115, 61271392).