Mathematical Problems in Engineering

Mathematical Problems in Engineering / 2014 / Article
Special Issue

Artificial Intelligence and Its Applications

View this Special Issue

Research Article | Open Access

Volume 2014 |Article ID 429629 | 19 pages | https://doi.org/10.1155/2014/429629

Towards a Unified Sentiment Lexicon Based on Graphics Processing Units

Academic Editor: Yudong Zhang
Received16 Jul 2013
Accepted11 Oct 2013
Published13 Mar 2014

Abstract

This paper presents an approach to create what we have called a Unified Sentiment Lexicon (USL). This approach aims at aligning, unifying, and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. One problem related to the task of the automatic unification of different scores of sentiment lexicons is that there are multiple lexical entries for which the classification of positive, negative, or neutral depends on the unit of measurement used in the annotation methodology of the source sentiment lexicon. Our USL approach computes the unified strength of polarity of each lexical entry based on the Pearson correlation coefficient which measures how correlated lexical entries are with a value between 1 and −1, where 1 indicates that the lexical entries are perfectly correlated, 0 indicates no correlation, and −1 means they are perfectly inversely correlated and so is the UnifiedMetrics procedure for CPU and GPU, respectively. Another problem is the high processing time required for computing all the lexical entries in the unification task. Thus, the USL approach computes a subset of lexical entries in each of the 1344 GPU cores and uses parallel processing in order to unify 155802 lexical entries. The results of the analysis conducted using the USL approach show that the USL has 95.430 lexical entries, out of which there are 35.201 considered to be positive, 22.029 negative, and 38.200 neutral. Finally, the runtime was 10 minutes for 95.430 lexical entries; this allows a reduction of the time computing for the UnifiedMetrics by 3 times.

1. Introduction

Written language has been the preferred medium of communication in order to express facts, assumptions, and opinions. The web has facilitated the connection of speakers overcoming the barriers imposed by location, language used, customs, context, culture, and so forth through electronic devices. Content producers are increasing their activity in blogs, web pages, portals, social networks, e-mails, chats, and so forth.

Surprisingly, the speaker boundaries of nationality are no longer distinguished, even using the Global Positioning System, due to the growth of a global migration and the merging of languages by multilingual professional communities.

In many countries, multilingual features occur in many families where parents have several mother tongues. They use a common language at home as they live in a country where another language is spoken. The children swap between an average of four languages simultaneously excluding the extra languages that they learn in school. For instance, according to the United States Census in 2007 [1], the percentage of individuals around five years and older who speak only English at home is 80.3% and who use a language other than English at home is 19.7%, among them Spanish accounts for 62.3% followed by Asian and Pacific Island languages with 15.0%.

On the other hand, the global economy is based on the digital applications such as e-commerce and online entertainment, social media including social enterprise, digital media advertising, the Internet, and cloud computing. For that reason, knowing the opinion of citizens becomes essential because they are increasingly active in the content production which enterprises, researchers, government, and intelligence services consider that can be monitored.

To analyse the web to discover sentiment is a daunting task due to the difficulty of getting accurate results. Nevertheless, machine learning algorithms have obtained good results classifying sentiment within specialized domains and using controlled corpora.

Most existing studies have been conducted using cluster or statistic analysis in order to classify sentences, paragraphs, and documents. Furthermore, several rates and user profiles have been used in the collaborative assessment of services.

In summary, a collective interest is to understand the thought of a global society. In this context, structured linguistic resources are vital and they should be supported by a group of linguists working on a global level.

Lexicons are atomic linguistic resources necessary for processing information automatically. The web and the information explosion is making the available lexicons insufficient because the web is heterogeneous, multilingual, and dynamic. Even though there are approximations for automatically creating sentiment lexicons, they definitely should be improved in the areas of verification task and expert assessment.

Our research focuses on the four languages with the greatest number of first-language speakers including Chinese with 1,197 million, Spanish with 406 million, English with 335 million, and Portuguese with 202 million (based on Lewis, et al. [2]). Each family contains 15,000,000, 350,000, 88,000, and X words, respectively, according to the Cambridge, RAE, Oxford, and Grande Dicionário Houaiss da Língua Portuguesa dictionaries in 2013. Expecting to produce a 100% robust sentiment lexicon in our research is a titanic task. However, to automatically increase their robustness with improved quality is possible.

Another important challenge is the representation of lexically encoded knowledge and the researchers are suggesting new ways to do this. Moreover, their structures are different from each other, making it difficult to reuse, so their resources become a problem of interoperability.

In addition, the same lexical entry can be found in many sources having distinct assessments. In this case, the unification task is key since one of our goals is to compare the strength of polarity between sources of information and their symbols in several languages. However, the problem of unifying the strength of polarity is primarily a problem of processing power due to the size of the sentiment lexicon, which makes a hand-by-hand analysis simply not feasible.

If we want to know the correlation of a lexical entry to the rest of the Unified Sentiment Lexicon, then the number of possibilities is 9.08E + 009. So computing is a problem of time as the calculation involved is huge.

Lexical resources, especially those semantically annotated, are time consuming and require a lot of effort; thus, we tried to use as much already existing work as possible in our effort to build a Unified Sentiment Lexicon.

Sentiment Lexicons that our research used areSL1 SentiWordNet developed by Istituto di Scienza e Tecnologie dell’Informazione,SL2 Bing Liu Sentiment Lexicon developed by Illinois University,SL3 MPQA lexicon developed by Pittsburgh University,SL4 NTU Sentiment Dictionary (NTUSD) developed by the Institute of Information Science of Taiwan,SL5 PanAmerican sentiment lexicon developed by the Polytechnic University of Madrid,SL6 Spanish Travel Subjective Lexicon (STSL) developed by the Polytechnic University of Madrid.

Our research questions areQ1 is it possible to unify the sentiment lexicons available on the web and align and expand them automatically?Q2 Is it possible to transform a Unified Sentiment Lexicon into a generative lexicon based on a core ontology?

The following set of hypotheses covers the main features of the proposed solutions:H1 the unification of sentiment lexicons allows for a robust linguistic resource,H2 given different strengths of polarity of the same lexical entry, it is possible to compute a unified value,H3 unification calculus of each one of the lexical entries with GPUs’ local and global memory allow the reduction of hard disk access and increase processing speed.

Compared with previous work, the major contributions of this paper are as follows:C1 a cluster of sentiment lexicons has been unified automatically and validated by experts,C2 the Unified Sentiment Lexicon has been expanded with two more sentiment lexicons that were developed by our research group Communication in Specialized Domains,C3 the task of unification uses parallel processing for computing each lexical entry with GPUs,C4 USL compute was accelerated by 3 times in data processing,C5 robustness of coverage of the Unified Sentiment Lexicon,C6 a uniform representation of lexically encoded knowledge.

In summary, this paper describes the Unified Sentiment Lexicon (USL) approach for aligning, unifying, and expanding the sentiment lexicons available in an automatic way in order to increase their robustness of coverage obtaining as a result a large-scale Unified Sentiment Lexicon based on GPUs.

The remainder of the paper is structured as follows. Section 2 briefly presents the background of our work. Section 3 describes the USL approach. Section 4 describes how our USL approach was implemented and the different subtasks of the algorithm in detail. Section 5 presents details of our data sets, evaluation metrics, and the result. Finally, Section 6 presents our conclusions and future research.

The related work considers the following: Section 2.1 data structures such as lexicons, specialized lexicons, and ontologies; Section 2.2 the methodologies available for building lexicons; and Section 2.3 the techniques of data processing focused on parallel processing and the kind of memory used for the TESLA architecture. Finally, we will present a summary table with the main features of the sentiment lexicons that are part of our study.

2.1. Data Structures

The lexical representation is founded in several data structures that form the basis of the linguistic resource, which is atomic. First, we will examine the lexicons; second, the generative lexicons; and finally, the ontologies.

2.1.1. Lexicons

According to Greame et al., a lexicon “is a list of words in a language along with some knowledge of how each word is used.” A lexicon can have monolingual or multilingual properties. Lexicons are either created manually [35], semiautomatic [6, 7], or automatically [811]. When the lexicon is built manually a group of experts can annotate all the words in a specific corpus; the assessment is performed by consensus and each lexical entry is checked in order to achieve excellence.

On the other hand, automatic lexicons can be produced, based on a specific corpus, where the lexical entries included far exceed the total number that can be compiled manually. However, to assess their quality is not an easy task.

In state of the art research [3, 4, 810], each group examines a collection of documents and produces their own lexicon. As a result, we have a number of lexicons—some of them are available on the web and others are not—we describe those in Section 2.1.3. In fact, we believe that all work carried out by universities and research groups should be used in a homogeneous way. Furthermore, these lexicons available should be reused using algorithms that facilitate data processing.

The generative lexicon was introduced by Pustejovsky [12, 13] in 1995 with the aim of encoding selectional knowledge in language. Differently to a generative lexicon, an enumerative lexicon only includes the different senses for each lexical entry. In Pustejovsky’s approach, there are four elements to encoding: lexical typing structure, argument structure, event structure, and qualia structure. For that reason Pustejovsky’s model deals with (a) the knowledge representation of the lexicon and the relation between an object and its constituents, (b) the formal role that distinguishes the object within a larger domain, (c) the purpose and function of the object, and (d) the factors involved in the origin of an object; all these constitute the qualia structure.

Bergler [14] said that there are significant efforts involved in building and sharing a big generative lexicon that will be a standard in the scientific community.

2.1.2. Ontology

According to Gruber [15] “an ontology is an explicit specification of a conceptualization.” In this sense, Graeme Hirst [16] said that “an ontology, as a nonlinguistic object that more directly represents the world, may provide an interpretation or grounding of word senses.”

The following supported sentiment ontologies are available: Ontology-Supported Polarity Mining [17] introduced in 2005, it was based on an ontology for movie reviews, with the positive or negative polarity determined from a collection of texts and the Chinese Emotion Ontology based on HowNet [18] which contains just under 5,500 verb concepts covering 113 different emotion categories.

2.1.3. Sentiment Lexicons

Our research has focused on four sentiment lexicons that are available on the web: the National Taiwan University Sentiment Dictionary (NTUSD), SentiWordNet, Bing Liu Sentiment Lexicon, and the Subjectivity Lexicon of Theresa Wilson et al. (MPQA). These are explained below.

The National Taiwan University Sentiment Dictionary (NTUSD) [19] was developed by Lun-Wei Ku et al. It is based on the Chinese Network Sentiment Dictionary and the web. There are 11,088 terms that qualify for the simplified version, of which 2,812 are positive and 8,276 are negative. The reason for having the NTUSD was to identify positive and negative sentiment words and their weights in a corpus of blogs and news on the basis of Chinese word structures. Lun-Wei Ku et al. suggested a method for annotating 192 documents in order to tag them as positive, negative, or neutral on three levels: words, sentences, and documents. The results were that the F-measure was 73.18% and 63.75% for verbs and nouns, respectively. When the sentiment words were mined together with topical words, they achieved an F-measure of 62.16% at the sentence level and 74.37% at the document level.

SentiWordNet [20] is a lexical resource produced by Istituto di Scienza e Tecnologie dell’Informazione, Italy. The main objective is to automatically estimate the value of all entries of WordNet [21] as positive, negative, or neutral assigning to each a weight between zero and one according of the value. The reason to create the SentiWordNet automatically was that the WordNet has more than 155,287 entries and annotating them manually would be a time consuming task. The result is a sentiment lexicon with 117,659 terms classified into the same four lexical categories as WordNet: adjectives, nouns, verbs, and adverbs.

Bing Liu Sentiment Lexicon [22, 23] is a lexicon where a small list of adjectives was manually created and tagged with positive or negative labels. It is domain-independent and he proposed a technique to grow this list using WordNet. He used a web-mining method to obtain a set of adjectives in the same way that the speaker wrote them. Thus, their lexicon has entries that are not in the English dictionary. The results obtained are 2,006 positive words and 4,764 negative words.

The Subjectivity Lexicon of Theresa Wilson et al. [3] is a lexical resource where the words that are subjective in most contexts are marked as being strongly subjective and those that may only have subjective usages in certain contexts are marked as weakly subjective. The process of building it was manual. The words were also classified according to their categories as nouns, verbs, adjectives, and adverbs. The results are 8,221 clues (as she call them) where 4,913 are negative, 2,721 are positive, and 571 are neutral.

The Spanish Travel Subjective Lexicon (STSL) [4] was built ad hoc based on a web-based corpus of blogs that were analysed within the framework of appraisal theory [24]. The blogs were analysed to create a subcorpus of sentences annotated according to appraisal and these sentences were classified as positive or negative considering some contextual rules that could influence the strength of the polarity. These sentences were used to build the lexicon. The words were classified according to their categories as nouns, verbs, adjectives, and adverbs; multiword units were also included. The result was 1,610 terms of which 857 are positive and 753 are negative.

The PanAmerican Sentiment Lexicon approach aims to classify according to polarity a set of internet resources focused on an event. The approach is based on four components: a crawler, a filter, a synthesizer, and a polarity analyzer. The main function of the crawler component is to search and find data from internet resources related to the event of interest. After locating the data, the filter component processes the data in order to remove noise. The filter component only debugs internet resources that are associated with the event. At this point, the corpus consists of large posts containing large amounts of data from many countries and in many languages. The synthesizer component represents the amount of data into clusters with similar expressions using unsupervised learning. Finally, the polarity analyzer component classifies each lexical entry as positive, neutral, or negative. The lexical categories are noun, adjective, verb, adverb. Finally, the result arrived at of 6083 positive, 5300 negative, and 5000 neutral.

2.2. Methodology

We have identified several kinds of methodology for building sentiment lexicons and have classified them as follows: automatic, semiautomatic, and manual.

(1) Automatic Methodology. First, the crawler is used to obtain a set of lexical terms in a controlled domain. Next, data preprocessing is performed and terms are assessed and classified as positive, negative, or neutral. The evaluation task involves using a subset of annotated lexical entries created manually by experts in order to measure the accuracy of the results. However, one of the limitations is the quality of the results because undertaking the evaluation task manually is not feasible. One of the advantages of this methodology is the higher number of terms produced compared to the results of the manual methodology.

(2) Semiautomatic Methodology. When linguistic resources use this methodology both manual and automatic annotations are used. An initial lexicon is annotated manually and this subset is used for training the algorithm, which will predict the level of matching for automatically classifying each new lexical category in the lexicon.

(3) Manual Methodology. Experts manually annotate each lexical entry in the appropriate lexical category. The lexicon quality is high, but the depth of the lexicon is less than that obtained with other methodologies.

2.3. Parallel Processing

Parallel processing [25] allows the running of several jobs at the same time and to accelerate the process by producing answers concurrently. Graphics Processing Units (GPUs) [26] allow the implementation of several algorithms [27].

These algorithms have been proved to provide acceleration from 2x to nx for specific problems [28].

Previous research has focused on image systems, simulations of fluids and molecular simulations with GPUS. Image systems [29] such as TechniScan uses ultrasonic waves to image the patient’s chest in 20 minutes. Some other examples of the use of this technology are the following. The University of Cambridge was able to accelerate computational simulations of fluid dynamics [30] in order to perform rapid experimentation. Temple University performs molecular simulations [31] in order to reduce the environmental impact of detergents and cleaning agents. The time these simulation lasted was reduced from several weeks to a few hours.

We explored the use of GPU technology [32] in order to accelerate our data preprocessing.

TESLA [33] is a GPU architecture produced in 2003 by NVIDIA. It consists of a shared memory, constant cache, register file, double precision, Special Function Unit (SFU), Streaming Processor (SP), and a Warp Scheduler.

Memory is an essential component of high-performance computing. CUDA uses several types of memory [34] depending on the problem. The host memory is in the GPU system. CUDA provides APIs that enable faster access to the host memory by using the pager block and mapping the address direction on the GPU.

The memory device is in the GPU. It can be accessed by the dedicated memory controller. Data must be explicitly copied between the host memory and device memory. This memory can be organized and accessed in different ways [35].Global memory can be static or dynamic. Access is by CUDA core pointers. Its main function is to translate the addresses.Constant memory is read only. Its access is through a hierarchical cache optimized for transmission to several threads.Local memory is stored in the stack: local variables which cannot be stored in records, parameters, and the return addresses of subroutines.Texture memory is accessed by instructions for loading and storing. As well as constant memory, an independent cache is used in order to execute read only operations.

The GPU CUDA device is a multicore coprocessor. It is possible to log in through all the device memory without constraints. However, there will be variations in runtime according to the type of target memory.

2.4. Summary Table

Table 1 summarizes the sentiment lexicons of our study according to several features such as name of the university where it was developed; depth of the lexical entries, which is the sum of all the positive, negative, or neutral entries; coverage of language; type of elaboration methodology; lexical categories; and ordering procedure.


Name University Positive Negative Neutral Language Methodology Category Order

Bing Liu Illinois 2006 4783 0 English Automatic No Alphabetic
MPQA Pittsburg 2721 4913 571 English Manual N, V, Ad, Adv Alphabetic
NTUSD SINICA 2812 8276 0 Chinese Semiautomatic No Phonetic
Pan American UPM 6083 5300 5000 English
Spanish
Portuguese
Manual N, V, Ad, Ad Alphabetic
SentiWordNet ISTI 857 753 0 English Automatic N, V, Ad, Ad No
STSL UPM 19619 29792 89135 Spanish Manual N, V, Ad, Ad
Interjections diminutives phrases
Alphabetic

3. The Approach Proposed

Our approach aims at aligning, unifying, and expanding the set of sentiment lexicons which are available on the web in order to increase their robustness of coverage. It is composed of ten components: FocusCrawlerEngine, SelectorLanguages, MetricSearcher, MetricTransformer, SentimentLexiconIntersection, LexicalEntries Substracter, LexicalEntriesDivisor, UnifiedMetrics, UnionSentimentLexiconEngine, and Lexicon2OntologyConverter.

The USL approach has as an input of sentiment lexicons which are supported by universities and which are available on the web. First, the SelectorLanguages component creates a group of sentiment lexicons according to their language and stores them in different knowledge bases. The MetricSearcher component performs an inspection of each one of the elements of the sentiment lexicons in order to identify if they have associated metrics. Then it saves the results in two knowledge bases: (a) MetricsLexicon and (b) NoMetricsLexicon. Next, the MetricTransformer component verifies if the metrics are not numerical in order to transform them with real values based on the original assessment. Consequently, our USL approach performs the intersection between all the sentiment lexicons. The common lexical entries are extracted with word, the strength of polarity, and sentiment lexicon values. Two knowledge bases are obtained as partial result: IntersectionLexicalEntries and NoIntersectionLexicalEntries.

The MetricsLexicon Knowledge base is the input for the LexicalEntries Substracter whose main function is to exclude all the IntersectionLexicalEntries. The USL approach is able to calculate a unified strength of polarity between all of the lexical entries of several sentiment lexicons. Thus, calculating the unified strength of polarity demands a high processing time because of the number of lexical entries. The LexicalEntriesDivisor split jobs in order to calculate the unification strength of polarity into balanced loads. The coprocessors compute the degree of unified subjective for each lexical entry. Its calculation is based on a previous assessment of the sentiment’s lexical sources and the incomplete information. Each coprocessor produces a lexical knowledge base with the score of the unified metric. The UnionSentimentLexiconEngine unifies all the knowledge bases into one. As a result we have the Unified Sentiment Lexicon (USL). Finally, the Lexicon2OntologyConverter performs a transformation from data to Ontology Web Language (OWL).

The ten components of USL approach will be described in detail.

3.1. FocusCrawlerEngine

Their main function is to find the sentiment lexicons available on the web that would be supported by universities. Then we can define the following: (1) MPQA Lexicon, (2) Bing Liu Sentiment Lexicon, (3) SentiWordNet, (4) NTU Sentiment Dictionary, (5) Spanish Travel Subjective Lexicon, and (6) PanAmerican Games Sentiment Lexicon.

Consider whereType = is a string which measures the subjectivity degree of each clue},Length = is a integer number that indicate the length of the clues},Word = is a string with the token or stem of the clue},Position = is a natural number than identify a lexical},Stemed = is a boolean than should match all unstemmed variants},PriorPolarity = is a string with the prior polarity of clue}.

Consider where Words = is a string with a word that is positive or negative}.

Consider where is a string with a word that is noun and positive}, is a string with a word that is noun and negative}, is a string with a word that is adjective and positive}, is a string with a word that is adjective and positive}, is a string with a word that is verb and positive}, is a string with a word that is verb and positive}, is a string with a word that is adverb and positive}, is a string with a word that is adverb and positive}.

Consider where Words = { is a string with a word that is positive or negative}.

Consider where is a character of WordNet}, is a character of WordNet},PosScore = is a real number with the positive score assigned by SentiWordNet},NegScore = a real number with the negative score assigned by SentiWordNet},SynsetTerms = is a string number with the terms}.

Consider

whereID = is an integer number to identify a word},TimeStamp = is a date of assessment},Word = is a string with the word},Positive = is a boolean with 1 if it is positive},Negative = is a boolean with 1 if it is negative},Neutral = is a boolean with 1 if it is neutral},Noun = is a boolean with 1 if it is noun},Adjective = is a boolean with 1 if it is adjective},Verb = is a boolean with 1 if it is verb},Adverb = is a boolean with 1 if it is adverb},Language = is a string with the name of the language}.

3.2. SelectorLanguages

This identifies the language in a subset of lexical entries in order to search for specific words in four languages: Chinese, Spanish, English, and Portuguese. The result is the cluster of sentiment lexicons arranged by language, as shown in (1).

Consider

3.3. MetricSearcher

It is responsible for selecting the strength of polarity label of each of the sentiment lexicons clusters. For example, “PriorPolarity,” “PosScore,” “NegScore,” and “ScoreSubjectivity,” among others. Besides, it searches for strength of polarity and indicates whether the values are numerical or nominal. It splits its result in two knowledge bases: MetricsLexicon and NoMetricsLexicon.

Consider

3.4. MetricTransformer

The MetricTransformer works by transforming the strength of polarity nominal value into the real value of each sentiment lexicon. It has two variables: type and pos. Type can take two values: and . Pos can take four values: , , , and . The new strength of polarity is the multiplication between the two variables.

Consider

3.5. SentimentLexiconIntersection

This component compiles with the intersection for all the lexical entries of each sentiment lexicon cluster. It aims to identify which lexical entries appear more than once in order to select them for processing.

Therefore, the two knowledge bases are IntersectionLexicalEntries and NoIntersectionalLexicalEntries. For example, the cluster of sentiment lexicons grouped by English language has four elements EnglishCluster = {PanAmericanSentimentLexicon, BingLiuSentimentLexicon, SentiWordNet, and MPQALexicon}. These intersections are shown in (10).

Consider

3.6. LexicalEntriesSubstracter

This gets the rest of all the elements that have been assessed by each university. It subtracts MetricsLexicon from IntersectionLexicalEntries.

Consider

3.7. LexicalEntriesDivisor

It has as its input the intersection of all the lexical entries. It divides the knowledge base into equal parts for processing.

Consider

3.8. UnifiedMetrics

This performs an estimate of each lexical entry of the IntersectionLexicon in order to predict its value. There are two procedures: (1) and (2) .

uses a Pearson correlation formula as shown in (13) applied between the Unified Sentiment Lexicon and each of the sentiment Lexicons by cluster.

algorithm in detail is explained in Section 4.

Consider

3.9. UnionSentimentLexiconEngine

Its function is to join all the result knowledge bases of the coprocessors together and as output the Unified Sentiment Lexicon is obtained.

Consider where UnifiedSentimentLexicon1  . UnifiedSentimentLexiconn UnifiedSentimentLexicon1  or   UnifiedSentimentLexiconn.

3.10. Lexicon2Ontology Converter

Their main function is to transform the Unified Sentiment Lexicon into a Domain Ontology: OntoLexicon as defined as follows.

3.10.1. OntoLexicon

The OntoLexicon Ontology is a conceptual description based on a lexicon of the subjective words in Natural Language as shown in (15). The OntoLexicon Ontology consists of four disjoint sets , , , and , where means concept identifiers (16), means relation identifiers (17) and (18), means attribute identifiers (19), and means data types (20).

Consider

The set of concepts is

The set of relations is where the relation hierarchy defines that Lexical has the relation entry_of that belongs to SentimentLexicon. Corpora has the relation document of that belongs to documents, following the same logic where the rest of the relations are defined as

The set of attribute identifiers is

The set of datatypes contains only one element, a string, is shown

The first axiom defines the concept NegativeAdverbs as equivalent to saying that there is a negative adverb which stands in a relation with the corresponding sentence, following the same logic where the rest of the axioms are defined as

4. Algorithm in Detail

The input of a Unified Sentiment Lexicon (USL) approach consists of the sentiment lexicons that are available on the web and supported by universities. The USL approach then processes all of them. The result is the Unified Sentiment Lexicon (USL). Here, we will describe the algorithm in detail.

The first step is to group the sentiment lexicons into clusters by language as following:

The second step is to search lexical entries that have been assessed by each sentiment lexicon. In the assessment task, some authors and their methods have used nominal values, while others have used real values. If they are linguistic values, then the USL approach transforms them into real values. There must then be an intersection of lexical entries in at least two sentiment lexicons in order to unify the strength of polarity of several sentiment lexicons into one.

Following this, our approach calculates the Pearson correlation score between each sentiment lexicon and the USL by obtaining as many constants as there are sentiment lexicons in the cluster. For example, if the cluster belongs to the English language, then there are four constants that fall into each sentiment lexicon, as shown in the Pearson correlation set . This calculation is performed only once and executed by the CPU.

Since the number of lexical entries is high, the computation of the USL score should be divided into several coprocessors (cores) in order to accelerate the process. In fact, each coprocessor of the GPU has as an input: (a) the strength of polarity of n lexical entries and (b) the vector with Pearson values. Each coprocessor computes the strength of polarity of every lexical entry until there are no lexical entries left. The score for each lexical entry is multiplied by the Pearson correlation between all the sentiment lexicons, as shown in (23) and Table 2.


Words Sentiment Sentiment Sentiment Sentiment USL


Consider

In addition, USL performs a total of subjectivity sums, as shown in (24) and Table 2.

Consider

The USL score is normalized by dividing the total number of subjectivity for each lexical entry by the Pearson correlation sum of the lexical entries that were assessed , as follows:

The GPU results are the lexical entries combined with the USL score (these are input by the CPU). Their main function is to join all the partial results in the USL.

Finally, the CPU transforms the USL into an ontology called OntoLexicon in OWL language.

The pseudocode of the main USL approach is shown in Algorithm 1, and some of the procedures of the USL approach are shown in Algorithm 2.

(1) procedure UnifiedSentimentLexicon( )
(2)   ;
(3)   ;
(4)  for     do
(5)   for     do
(6)    for   ,    do
(7)     if     then
(8)       ;
(9)     else
(10)       ;
(11)     end if
(12)     if     then
(13)       ;
(14)     end if
(15)    end for
(16)   end for
(17)  end for
(18)  if     then
(19)    ;
(20)  else
(21)    ;
(22)  end if
(23)  
(24)  ( );
(25)   ;
(26)  for   ,     do
(27)   for     do
(28)     ;
(29)     ;
(30)     ;
(31)   end for
(32)  end for
(33)   (UnifiedSentimentLexicon);
(34) end procedure

(1) procedure FocusCrawlerEngine
(2)  for     do
(3)    ;
(4)  end for
(5) end prcedure
(6) procedure  SelectorLanguages( )
(7)  for     do
(8)   switch     do
(9)   case  
(10)     assert( )
(11)    case  
(12)    assert(