Smart Data: Where the Big Data Meets the SemanticsView this Special Issue
Automatic Construction and Global Optimization of a Multisentiment Lexicon
Manual annotation of sentiment lexicons costs too much labor and time, and it is also difficult to get accurate quantification of emotional intensity. Besides, the excessive emphasis on one specific field has greatly limited the applicability of domain sentiment lexicons (Wang et al., 2010). This paper implements statistical training for large-scale Chinese corpus through neural network language model and proposes an automatic method of constructing a multidimensional sentiment lexicon based on constraints of coordinate offset. In order to distinguish the sentiment polarities of those words which may express either positive or negative meanings in different contexts, we further present a sentiment disambiguation algorithm to increase the flexibility of our lexicon. Lastly, we present a global optimization framework that provides a unified way to combine several human-annotated resources for learning our 10-dimensional sentiment lexicon SentiRuc. Experiments show the superior performance of SentiRuc lexicon in category labeling test, intensity labeling test, and sentiment classification tasks. It is worth mentioning that, in intensity label test, SentiRuc outperforms the second place by 21 percent.
Opinion mining and sentiment analysis of online text have become a hot research area in recent years, which includes opinion summarization and sentiment classification. Most of these tasks would benefit from a high quality sentiment lexicon which could provide excellent sentiment features when no training data is available.
The primary form of sentiment lexicons is binary annotation with positive and negative labels, such as Sentiwordnet developed by Italian Information Technology Research Institute [1, 2], the Chinese general sentiment lexicon (NTUSD)  annotated by Taiwan University, the Chinese emotion dictionary from the Chinese Academy of Sciences, and English Xsimilarity. Multiple sentiment lexicons with assignments of the strength of sentiments are also constructed, such as the Affective Lexicon Ontology of Dalian University of Technology (DUT Ontology) . To determine the word-level strength of sentiment, manual methods, supervised methods employing WordNet or other semantic resources, and unsupervised approaches based on large-scale corpus were proposed. But few works evaluated and optimized the accuracy of intensity annotation by introducing all possible linguistic heuristics.
In recent years, driven by diverse tasks in different fields, both the polarity word and its related target are included as a sentiment item. However, the application areas of such 2-tuple lexicons as 〈polarity word, target〉 are strictly limited to one specific field, and also the size of such lexicons could easily explode with the growth of training data, which causes the problem of sparseness of features. Massive online text makes the limitations of domain sentiment lexicons increasingly apparent, especially when the sentiment classification tasks vary in different areas. Thus, a general and adaptable lexicon is important for sentiment analysis to avoid this problem.
This paper presents a method of automatic construction and optimization of a multisentiment lexicon through statistical analysis of a massive online corpus. The main content of this paper is as follows. First, we use neural network language model to obtain distributed representations of words from a massive online corpus (Sogou News Corpus, 3.17 GB) . Second, we study the categorization of sentiments and select seed words for each category. After that, polarity words are selected and the semantic distances between polarity words and seed words are calculated with the distributed representations. The distance values are then converted into sentiment intensity through appropriate constraints. Finally, we evaluate the lexicon by combining linguistic heuristics in an optimization framework. Besides, we study sentiment tendency disambiguation method to improve the semantic description capability of our lexicon.
The remainder of this paper is organized as follows. In Section 2, we introduce some related works. The principle of automatic construction of SentiRuc is proposed in Section 3. Section 4 introduces a unified optimization framework. Experiments and evaluations are reported in Section 5. We conclude the paper in Section 6 with future researches.
2. Related Work
Many Chinese sentiment lexicons, such as NTUSD, HowNet, and DUT Affective Lexicon Ontology, are manually annotated to ensure the lexicon’s coverage and effectiveness. But manual methods usually cost too much labor and time and also tend to be subjective; the coverage is also a concern. To provide more granularities, it is necessary to introduce statistical language model to automatically annotate sentiment category and intensity.
To label the sentiments, we should first study the sentiment categorization. As early as 1957, Osgood distributed human emotion to three aspects: strong and weak, good and bad, active and passive . In 2012, Liu et al. presented the DUT Affective Lexicon Ontology which contains 7 sentiments: happiness, liking, anger, sadness, hate, fear, and surprise . QuanChangqin constructed Ren-CECps with 8 kinds of sentiments: expectation, joy, love, surprise, anxiety, sorrow, angry, and hate . But existing categorizations of emotions are asymmetric. For example, there is no opposite emotion of “surprise” or “fear,” which can cause inconvenience in feature extraction and selection in supervised methods of sentiment analysis. Besides, there is coupling between emotion categories, such as “praise” and “like.” Therefore, sentiment classification needs to be investigated to suit both psychology and computational linguistics.
In addition to qualitative labeling, the sentiment intensity needs to be annotated quantitatively. A lot of the existing lexicons are manually annotated, including WordNet , General Inquirer , and HowNet . To avoid the low efficiency and subjectivity of manual work, bootstrapping methods have been widely used. It is usually assumed that several seed words of known polarities are provided and different heuristics are adopted as the propagation strategy to infer the unknown sentiment polarities of other words. He sent the entries of HowNet into Google search and selected seed words according to the count of search results . Li et al. introduced Pagerank to determine the polarity of words . Each word is taken as a node in a graph, and HowNet is used to calculate the semantic similarity between seed words and candidate words as edge weights. The performance of these supervised methods is dependent on or limited by the accuracy of the third party tools or data. One possible way to solve this problem is to use unsupervised methods to obtain sentiment intensity from other corpuses or semantic resources. Colace et al.  construct the Mixed Graph of Terms by extracting 2-tuples like 〈side, evaluation〉 from comment text, and each item’s intensity is inferred according to the domain knowledge in the Mixed Graph. Mukkamala et al.  define the expression of emotion as a fuzzy set of 4 elements 〈topic, keyword, object, and tendency〉. The relation strength of every 2 sets is determined through a membership function based on set theory and fuzzy logics. Turney and Littman  propose a semantic classification method based on emotional phrases. He first extracts the adjective or adverb phrases according to several defined templates and calculates the mutual information between words and phrases to determine the tendency and intensity of sentiment words. These unsupervised methods offered a lot of experience and help to us, but there is still much dependence on the accuracy of choosing, recognizing, and extracting various kinds of relationships of sentiment items.
Therefore, it is important to optimize the collection of sentiment lexicon entries and the intensity labeling. Chen et al.  construct polar square error function to decide if two entries have the same sentiment tendency and present an iterative expansion method. Turney and Littman  try to rationalize the intensity assignment by comparing the cooccurrence parameter of entries and seed words. Wang et al.  and Jo and Oh  both take tendency annotation as a by-product of sentiment classification task and yet failed to evaluate the annotation’s quality. Some scholars try to introduce synonymous or antonymous relationships into the evaluation framework to optimize the intensity labeling [19, 20]. Compared with our work, the optimization framework of mentioned works is relatively simple and fails to take multisentiment words into consideration, which may express different tendency in various contexts.
Considering the above points, this paper presents an unsupervised model of automatic construction of a multisentiment lexicon based on WLI neural network language model  and a global optimization framework. The main contributions of this paper are as follows:(1)We propose a new categorization of human emotions, which makes the linguistic features more suitable for computational analysis.(2)We define the converting constraint set of distance and sentiment intensity and present an automatic construction model based on WLI language model.(3)We present a global optimization framework based on several manually annotated semantic resources, to improve the semantic description of our lexicon SentiRuc.
3. Automatic Construction of SentiRuc
In this section, we present the “5 pairs with 10 polarities” categorization of human emotions and automatically annotate the multisentiment lexicon SentiRuc by defining the converting constraint set of distance and sentiment intensity. We also investigate the emotional disambiguation of multiple affective words in this section.
We integrated the entries of NTUSD dictionary, HowNet lexicon, and the DUT Ontology as the entries of our SentiRuc lexicon, which contains a total of 14250 emotional words.
3.1. WLI Language Model and the Categorization of Human Emotions
Traditional binary sentiment labeling has gradually become unable to meet the development of sentiment analysis tasks. The primary work of multiple sentiment labeling is the categorization of human emotions. Section 2 has discussed relevant achievements and existing problems. This paper takes the achievement of psychology, linguistics theory, and computation characteristics into consideration and categorizes human emotions to 10 categories: happy-sad, like-hate, believable-unexpected, gratitude-angry, and complementary-critical. Each pair contains 2 opposite sentiment polarities. Our goal is to annotate each sentiment word W with a 10-dimensional sentiment vector Senti(W), and the value of each dimension represents the intensity of the corresponding sentiment tendency. In the following research, the 10 words are directly adopted as seed words.
Words contain very rich meanings, and statistical language models are used to extract those semantic features. Given a corpus, neural network language model could map words into a high dimensional continuous space. Word2Vec is a tool based on deep learning and is released by Google in 2013, which adopts two main language models: the continuous bag of words model and continuous skip-gram model . Mikolov et al. also found that the representations have very good linear semantic characteristics , so, in 2015, WLI neural network language model is presented to decrease the model complexity . We accumulate the offset of corresponding dimensions of two-word representations as the linear semantic distance between them and further investigate how coordinate offset could affect word similarity. In this paper, we use Sogou News Corpus as the training set, which contains about 1.1 million different words.
3.2. Converting the Distances of Words into Word Similarities
All word representations are located in a high dimensional vector space, in which we determine an entry’s polarity and intensity by computing the distance between the entry and seed words. However, there are many words that could express, for example, happiness. And it is difficult to choose one as the only seed of “happy.” Here, to decrease the deviation caused by subjectivity, we use coordinate offset of word representations to list the 50 nearest neighbors of “happy” and then manually choose several words as the seed set of “happy.” For example, we collect all distances between “bittersweet” and “happy” seeds and take the average distance as the distance between “bittersweet” and “happy” emotion. For any word W, we can obtain a 10-dimensional distance vector Dis(W) and each dimension of Dis(W), respectively, represents the distance between W and happy, like, believable, gratitude, complimentary, sad, hate, unexpected, angry, and critical
Previous research pointed out that, generally, a word mainly contains only one or two emotions , so we preserve the minimum one or two distances in Dis(W) as effective distances. Larger distances will be abandoned, which means that those sentiments with lower similarities will be eliminated. If threshold is assigned with 3.00 for the word “bittersweet,” only 2 distances in Dis(W), happy 1.13 and sad 1.34, will be retained, because the sum of the 2 distances has not reached . That could be interpreted as “happy” and “sad” being the main sentiments contained in “bittersweet.” Only these two distances are retained and used in the followup work and only these 2 distances are retained and used as “effective distances” in the followup work.
Paper  points out that linear coordinate offset between word representations is directly associated with words’ semantic similarities. Therefore, we could annotate a word W’s polarity intensity according to the coordinate offset between W and seed words in the vector space of word representations. Considering there could be more than one effective distances in Dis(W), it is necessary to investigate how different distributions of those distances impact words’ similarities. To solve the problem of converting distance vector Dis(W) to sentiment vector Senti(W), we define 3 converting constraints.
Constraint 1 (diversity constraint). Each dimension of Senti(W) is denoted as Senti(W) ( is an integer ranging from 1 to 10), indicating word W’s intensity of each sentiment category. Senti(W) shall be negatively correlated with the count of effective distances Count(Dis(W)), because it is observed that words with more effective distances usually lie farther away from each sentiment category, which could be interpreted as the sentiment intensities being “distracted” by various polarities. For example, “rage” is only 1.92 away from the “angry” category, while “unfair” is 3.38 away from “angry” and 5.05 away from “critical.”
Constraint 2 (self constraint). Each dimension of Dis(W) is denoted as Dis(W) ( is an integer ranging from 1 to 10), indicating the distance between word W and each sentiment category. The intensity of a certain sentiment Senti(W) shall be negatively correlated with the corresponding distance Dis(W). The fact is that, in word representations, smaller distance indicates more semantic or pragmatic similarities.
Constraint 3 (global contrast constraint). The intensity of a certain sentiment Senti(W) shall be negatively correlated with the ratio of Dis(W) and the average effective distance Avg(Dis(W)). In a language, human habits cause big difference in word frequencies, and collocation of words also divides words into various clusters. These both impact the quantized word representations. For example, the effective distance vector of “enjoy” is (2.09, 1.11, 0, 0, 0, 0, 0, 0, 0, 0) and that of “enchanted” is (5.26, 3.87, 0, 0, 0, 0, 0, 0, 0, 0). The global contrast constraint is used to eliminate this disparity.
From the converting constraints set we could derive the generating formula of W’s sentiment vector Senti(W) as follows:
Senti(W) indicates word W’s intensity of each sentiment category. The formula contains three factors: the factor of diversity constraint Diverge, the factor of self constraint Self, and the factor of global contrast constraint Contrast. These factors can be, respectively, expressed as follows:
In formulas (3), (4), and (5), Count(Dis(W)) represents the count of effective distances and Avg(Dis(W)) is the average value of effective distances. According to constraints 1, 2, and 3, the positive or negative correlation has already been illustrated by denominators or numerators in formulas (3), (4), and (5). , , and are constants. In the experiments in Section 5, we will introduce the assignments of , , and . The 3 parameters , , and ,respectively determine the effect of each constraint. The optimal parameters can be trained through the optimization framework (Section 4).
Finally, to every sentiment word W, we annotate it in our sentiment lexicon with a 10-dimensional vector Senti(W). The value in each dimension represents the similarity between W and this sentiment, that is, W’s intensity of this sentiment.
3.3. Sentiment Tendency Disambiguation Based on Word Distribution Density
In Section 3 we introduced an automatic method to identify a word’s polarity and intensity. But some words convey different sentiment polarities in different contexts. It would be inappropriate to annotate such words in only one sentiment vector, so we investigate sentiment disambiguation in this section. Chen et al.  pointed out that “sentiment disambiguation is different from word sense disambiguation” because, in a general sentiment lexicon, a word’s sentiment tendency is not directly correlated with its meaning.
We use a hybrid approach in screening multisentiment words from our lexicon’s vocabulary. So far there is no effective method for automatic selection of multisentiment words. We attempted to extract words which appear in different synonym sets in “HIT Tongyicicilin” and “Synonym Lexicon for Pupils” and take these words as the candidate set of multisentiment words. However, shows good precision but poor recall. For example, “naive” could convey both positive and negative senses but is not covered in the candidate set. We finally decide to manually select multisentiment words from “HIT Tongyicicilin” and “Synonym Lexicon for Pupils” and take these words as multisentiment word set which include altogether 148 entries.
Then, 113694 sentences containing words in are selected from Sogou News Corpus, and the sentiment tendency of these words is annotated with a positive or negative label. In a context window size of 16, the distribution density of each context word is extracted and used as a feature of SVM classifier. The distribution density of a context word CW can be obtained by
Count(CWpositive) represents the count of context word CW in all sentences which have a positive W. Count(CWnegative) is the count of CW in all sentences that have a negative W. Count(CW) is the total count of CW in all sentences where W appears.
After the tendency disambiguation, a multisentiment word W is split and segmented as two independent cases and . The word representations would then be trained again, and the sentiment vectors of and could be generated through formula (2).
4. Global Optimization Framework
Section 3 presented a converting constraint set, and our lexicon SentiRuc is preliminarily generated. This section establishes a unified form of evaluation function to study the effects of various constraints. We’ve collected data from HIT Tongyicicilin, Synonym Lexicon for Pupils, Antonym Lexicon for Pupils, and the dataset of NLPCC 2013 Competition and NLPCC 2014 Competition. These datasets are all manually constructed resources and thus could be regarded as gold standards. The error function is used to evaluate the deviation between SentiRuc and those gold standards, and our goal is to find a set of parameters that minimizes the deviation.
4.1. Synonymous Relationship
If and are annotated as a pair of synonyms in HIT Tongyicicilin or Synonym Lexicon for Pupils, we can infer that their sentiment polarities and intensities tend to be similar. To formalize this intuition, we accumulate the deviation of sentiment intensity of and on corresponding dimensions in SentiRuc. The error function is shown as follows:
Senti(W) indicates word W’s intensity of each sentiment category. In formula (7), and are labeled as synonyms and Pair, is the count of synonym pairs. represents the average deviation of sentiment intensity of and on corresponding dimensions, when and are in both SentiRuc and synonym resources. varies with parameters , and in Formulas (3), (4), and (5).
4.2. Antonymous Relationship
If and are annotated as a pair of antonyms in Antonym Lexicon for Pupils, we can infer that they shall have opposite sentiment polarities, where the intensities tend to be similar. In accordance with this intuition, we accumulate the deviation of sentiment intensity of and on opposite dimensions in SentiRuc. The error function is shown as formula (8)
Senti(W) indicates word W’s intensity of each sentiment category. In formula (8), and are labeled as antonyms (Pair, and are the indices of opposite sentiments in sentiment vectors of and . is the count of antonym pairs. represents the average deviation of sentiment intensity of and on opposite dimensions, when and are in both SentiRuc and antonym resources. varies with parameters , , and in formulas (3), (4), and (5).
4.3. Sentiment Ratings at the Sentence Level
If the annotation of SentiRuc could contribute more in relevant tasks, the sentiment classification result using SentiRuc shall be closer to human judgment than that using other lexicons. We select 6000 sentences from the dataset of NLPCC 2013 Competition and NLPCC 2014 Competition and label the sentences with a “main sentiment” and an optional “subsentiment,” which are both involved in the 10 sentiment categories of SentiRuc. For a certain set of parameters in formulas (3), (4), and (5), we generate an individual annotation of SentiRuc for the sentiment classification task. The sentence-level error function is constructed by Jaccard similarity of classification result and labeled result, represented as follows: is the identifier of each sample. is the number of sentences contained in the dataset. Label() represents the labeled sentiment vector of a sentence and Sentence() is the classified sentiment vector using SentiRuc. shows the average of Jaccard similarity of each sentence’s labeled result and classification result. varies with parameters , , and .
4.4. The Global Error Function
By combining the above three evaluation methods we obtain the global optimization framework based on manually constructed resources, as shown in Figure 1.
The global error function is
We first evaluate the generating process of SentiRuc and then verify the availability of SentiRuc. To evaluate the rationality of the generating process, we design the parameters tuning experiment to prove the rationality of constraint set (Section 5.1) and verify the validity of sentiment tendency disambiguation method (Section 5.2). To test the availability of SentiRuc, we compare the qualitative and quantitative annotation of SentiRuc with other lexicons (Section 5.3) and investigate the performance of different lexicons in sentiment classification tasks (Section 5.4). NTUSD Lexicon of Taiwan University, HowNet sentiment lexicon, and DUT Ontology are all involved in the experiments.
In all experiments, the threshold distance value is assigned with Avg(Dis(W)) so that only one or two kinds of sentiments would remain for each word. The parameter in formula (3) is set to 10, representing the number of sentiment categories. is set to 8, which means the number of sentiment categories minus the maximum remaining sentiments (). is set to 3.38, representing the average coordinate offset between every two words included in SentiRuc. The dimension number of word representations is 60. Either more or less dimensions would increase the value of error function of the intensity annotating result.
Sogou News Corpus (3.17 GB) is used as the training text set. After segmentation by ICTCLAS 5.0 developed by Chinese Academy of Sciences, this corpus contains about 0.83 billion words, and the vocabulary size is 1,104,914. We do not have any other preprocessing of the data, so it can be ensured that every n-gram sample is a real Chinese word sequence and also that the word representations can show the actual semantic distribution of each word.
5.1. Evaluation of the Generating Constraint Set
Section 3 introduced how to automatically annotate the multisentiment lexicon SentiRuc by defining the converting constraint set of distance and sentiment intensity. Section 4 presented a global optimization framework to optimize the parameters. We first set α, β, and γ to 1 as the baseline experiment. If a parameter is set to zero, it can be regarded as if this parameter is ignored. We conduct some contrast experiments where one parameter is dropped out in each experiment. The experimental result of with the parameters is shown in Table 1.
It can be seen that dropping any constraint would increase the global error, which indicates that all constraints are useful in computing the intensity of SentiRuc. When β is dropped, increases the most, which suggests that the self constraint contributes the most. It means that the intensity of a certain sentiment is remarkably negatively correlated with the corresponding distance, which proves the rationality of our method based on word representations. Then we try to find the optimal parameter set through variable-controlling method. As shown in the bottom three rows, the global error is further decreased, and the optimal parameter set is listed in the bottom line.
5.2. Evaluation of Tendency Disambiguation
Section 3.3 introduced how the 148 multisentiment words are selected. From Sogou News Corpus we collect sentences which contain these multisentiment words and label a multisentiment word W with “1” when W expresses positive tendency and with “2” if W contains negative tendency. To ensure excellent label result, eight Chinese native speakers participated in the annotating work. Every researcher made independent annotation of about 50 thousand sentences and each sentence is annotated by 4 researchers. If there is confliction in a sentence’s label result, we made the final result through a panel discussion. In total, 113,694 sentences are annotated with positive tag or negative tag.
According to the labeled result, the tendency disambiguation algorithm based on word distribution density, which is introduced in Section 3.3, is used in the experiment. The experimental results of tenfold cross validation are shown in Table 2.
The overall disambiguation accuracy of all 148 words in the 113,694 sentences reaches 95.52%. The entries “epigone” and “yes-man” get the lowest accuracy, mainly due to the limited training data caused by their low occurrence. Generally, this experiment result shows that our disambiguation algorithm can effectively distinguish different tendencies of a word.
5.3. Evaluation of the Annotation’s Quality of Sentiruc
The sentiment polarity and intensity of SentiRuc are both drawn from a Chinese corpus of GB grade level; therefore, its semantic description should be closer to actual semantic distribution than manually constructed lexicons. We try to evaluate the annotation quality of several existing lexicons by analyzing their sentiment category consistency (qualitative evaluation) and sentiment intensity consistency (quantitative evaluation). Sentiment category consistency examines the similarity of synonyms’ (or antonyms’) tendency annotation in SentiRuc. Sentiment intensity consistency refers to the similarity of synonyms’ (or antonyms’) intensity annotation in SentiRuc.
HIT Tongyicicilin and Synonym Lexicon for Pupils contain 55,265 synonyms, from which we selected 2500 synonyms as the test dataset . The 1774 antonyms in Antonym Lexicon for Pupils are taken as test dataset . Multisentiment words are not included in or . The intensity consistency and intensity consistency can be expressed as
Senti(W) ( is an integer ranging from 1 to 5) indicates word W’s intensity of each positive sentiment category. and represent the count of corresponding dimensions which are annotated with nonzero value.
Tables 3 and 4 indicate that, in SentiRuc, the tendency annotation of synonyms and antonyms are both closer to manually labeled resources than other sentiment lexicons. In the condition that each word’s sentiment intensity is computed independently, the intensity consistency of synonyms and antonyms in SentiRuc reaches 92% and 91%. This score is up to 20 percentage points higher than the manual intensity annotation of DUT Ontology, which is far more than expected. The results prove the effectiveness of the converting constraint set and formula (2) and also indicate that SentiRuc has better semantic descriptiveness.
5.4. Evaluation of SentiRuc in Sentiment Analysis Tasks
This experiment investigates the performance of sentiment analysis task using different lexicons. 3,100 sentences are selected from NLPCC 2013 Competition and NLPCC 2014 Competition and 3,700 sentences containing one of the 148 multisentiment words are selected from Sina Microblog. All 6,800 sentences are labeled with a “main sentiment” and an optional “subsentiment” tag. We define 2-gram part of speech (2-POS) and 3-gram part of speech (3-POS) for every labeled sample and extract sentiment tendency features with the help of SentiRuc. SVM is used in the multivariate classification experiments. Compared with human annotation result, the accuracy of the multivariate classification reaches 62.0%.
In order to facilitate the comparison of different lexicons, we also conduct binary classification experiments (positive or negative). Each of the 6,800 sentences is labeled with a “positive” or “negative” tag by four Chinese native speakers. The other 3200 objective sentences without affection are also labeled with “neutral” and added in the test dataset. For each sentence, we extract the 2-POS and 3-POS features and identify sentiment features with the help of SentiRuc. We use SVM classifier to implement tenfold cross validation. In addition, we also investigate the performance of SentiRuc before and after the tendency disambiguation. The results can be evaluated by
Result_Correct is the number of sentences that are correctly labeled with “positive” (or “negative”). Result_Proposed is the number of sentences labeled with “positive” (or “negative”) by SVM model. Result_Labeled is the number of sentences manually labeled with “positive” (or “negative”). The result is shown in Table 5.
Table 5 indicates that the F-measure of positive and negative classification using SentiRuc is apparently higher than those using other lexicons. In all 6800 subjective sentences, the sentences containing multisentiment words account for 54.4% and such a high percentage results in an apparent difference before and after disambiguation. Such a high percentage also brings impact on the overall F-measure on general domain text, respectively, 0.726 and 0.627. Actually, on the 6300 sentences which do not contain any multisentiment word, the F-measure of positive text and negative text is, respectively, 0.817 and 0.742.
This paper presented an automatic construction and global optimization framework of a multisentiment lexicon SentiRuc. The main jobs include the categorization of human emotions, an automatic construction model based on WLI language model, a global optimization framework based on several manually annotated semantic resources, and the disambiguation of multisentiment words. The experiment in Section 5 indicates that SentiRuc performs well on general dataset. Particularly, in intensity labeling test, SentiRuc outperforms the second place by 21 percent, which proves that statistical language modeling performs outstandingly in the semantic representation of sentiments. Our lexicon is now available online (https://pan.baidu.com/s/1jHAInlG).
It is difficult to directly compare existing lexicons because of various sentiment categorizations. We will investigate appropriate evaluation method of multiclass sentiment classification tasks.
Although Section 5 has shown the outstanding performance of word representations in sentiment lexicon’s construction, the unique features of word representations still bring problems to text mining tasks. Firstly, statistical language models depend a lot on the correspondence of inner semantics and outer grammars; thus, it is of great significance to research how to comprehend and distinguish “similar” words, “related” words, and their association with word representations’ generating models. Secondly, similar words’ vectors only differ much at several specific dimensions and further research on this kind of characterization is needed. We will study weighted statistical language models and will investigate the feasibility and effect of introducing various vector operations into the estimation of semantic distances.
The authors declare that there are no competing interests.
This research was supported by the National Science Foundation for Young Scientists of China under Grant 61601371, the National Natural Science Foundation of China under Grant 71271209, Beijing Municipal Natural Science Foundation under Grant 4132052, and Humanity and Social Science Youth Foundation of Ministry of Education of China under Grant 11YJC630268.
A. Esuli and F. Sebastiani, “Sentiwordnet: a publiclyavailable lexical resource for opinion mining,” in Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC '06), pp. 417–422, Genoa, Italy, 2006.View at: Google Scholar
S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining,” in Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC '10), pp. 2200–2204, May 2010.View at: Google Scholar
L. Xu, H. Lin, and Y. Pan, “Constructing the affective lexicon ontology,” Journal of the China Society for Scientific and Technical Information, vol. 27, pp. 180–185, 2008.View at: Google Scholar
C. Quan and F. Ren, “Construction of a blog emotion corpus for Chinese emotional expression analysis,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1446–1454, Association for Computational Linguistics, Singapore, 2009.View at: Google Scholar
WordNet and C. Fellbaum, WordNet: An Electronic Lexical Database, Bradford Book, 1998.
General Inquirer(GI)[EB/OL], 2012, http://www.wjh.harvard.edu/~inquirer/.
F. He, “Orientation analysis for Chinese blog text based on semantic comprehension,” Journal of Computer Applicayions, vol. 31, pp. 2130–2133, 2011.View at: Google Scholar
R.-J. Li, X.-J. Wang, and Y.-Q. Zhou, “Semantic orientation computing using PageRank model,” Journal of Beijing University of Posts and Telecommunications, vol. 33, no. 5, pp. 141–144, 2010.View at: Google Scholar
L. Chen, W. Wang, M. Nagarajan, S. Wang, and A. P. Sheth, “Extracting diverse sentiment expressions with target-dependent polarity from twitter,” in Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Kno.e.sis, Dublin, Ireland, 2012.View at: Google Scholar
H. Wang, Y. Lu, and C. Zhai, “Latent aspect rating analysis on review text data: a rating regression approach,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '10), pp. 783–792, San Francisco, Calif, USA, July 2010.View at: Publisher Site | Google Scholar
A. Neviarouskaya, H. Prendinger, and M. Ishizuka, “SentiFul: generating a reliable lexicon for sentiment analysis,” in Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII '09), pp. 1–6, IEEE, Amsterdam, The Netherland, September 2009.View at: Publisher Site | Google Scholar
S. Mohammad, C. Dunne, and B. Dorr, “Generating high-coverage semantic orientation lexicons from overtly marked words and a thesaurus,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '09), pp. 599–608, August 2009.View at: Google Scholar
Z. Zhang, X. Yang, Q. Ma, and C. Xu, “Learning continuous word representations from large-scale corpus through linear approach,” in Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC '15), pp. 2678–2683, IEEE, Hong Kong, October 2015.View at: Publisher Site | Google Scholar
T. Mikolov, Statistical language models based on neural networks [Ph.D. thesis], Brno University of Technology, 2012.
T. Mikolov, M. Karafiát, L. Burget et al., “Recurrent neural network based language model,” in Proceedings of the 11th Annual Conference of the International Speech Communication Association, pp. 1045–1048, Chiba, Japan, September 2010.View at: Google Scholar
J. Chen, H. Lin, and Z. Yang, “Word emotion disambiguation based on Bayesian model,” in Proceedings of the 9th China National Conference on Computational Linguistics (CCL '07), 2007.View at: Google Scholar