Abstract

With the rapid growth of websites and web form the number of product reviews is available on the sites. An opinion mining system is needed to help the people to evaluate emotions, opinions, attitude, and behavior of others, which is used to make decisions based on the user preference. In this paper, we proposed an optimized feature reduction that incorporates an ensemble method of machine learning approaches that uses information gain and genetic algorithm as feature reduction techniques. We conducted comparative study experiments on multidomain review dataset and movie review dataset in opinion mining. The effectiveness of single classifiers Naïve Bayes, logistic regression, support vector machine, and ensemble technique for opinion mining are compared on five datasets. The proposed hybrid method is evaluated and experimental results using information gain and genetic algorithm with ensemble technique perform better in terms of various measures for multidomain review and movie reviews. Classification algorithms are evaluated using McNemar’s test to compare the level of significance of the classifiers.

1. Introduction

A basic task in sentiment classification is classifying the polarity of a given text in the document, sentence, or feature level, whether the expressed opinion in a review document, a sentence, or an entity feature is positive, negative, or neutral. The WWW is frequently used medium for exchanging the opinions of user reviews about the product, movie, and music. It provides a review text containing consumer opinions, emotions, and service opinions stored in websites, blogs, and web forms. Nowadays, a number of review websites, web forms, and blogs are growing rapidly. The blogs are used to store important text and the user expresses their emotions, feelings, and opinions through blogs [1].

Sentiment analysis is one of the applications of natural language processing and text analytics to identify and extract subjective information in the source materials. It aims to determine the attitude of a writer with respect to some topic or the overall polarity of a document [13]. The attitude may be his/her judgment, affective state, or intended emotional communication. Web sites are used to express end user opinions, emotions and sentiment about the multi products reviews, movie reviews, music and story reviews. Sentiment analysis or opinion mining plays an important role and is difficult to analyze a lot of information individually.

Sentiment analysis not only helps people but also helps in business and an organization to evaluate sentiments or opinions. Based on the behavior of the customer an opinion about a product helps organizations during the decision making process.

To automate sentiment classification, there are several approaches that have been applied to review the documents. The approaches are natural language processing, machine learning algorithms such as maximum entropy, support vector machine, Naïve Bayes, -nearest neighbor, decision tree algorithms combined with feature selection methods to predict the polarity of the user reviews, opinions, and emotions, such as positive, negative, and neutral [411].

In this paper, we have applied many supervised machine learning algorithms for opinion mining. A genetic algorithm is a search and an optimized feature selection algorithm which integrates with ensemble methods to improve the performance and overcome the limitations of traditional method. An optimization is a process of finding the best or an optimal solution for a sentiment classification.

The proposed approach is based on the machine learning approach that uses information gain feature reduction technique and optimized feature selection, genetic algorithm that incorporate bagging with TF-IDF weighting scheme. The proposed method is evaluated and experimental results using information gain, genetic algorithm with ensemble technique, indicate higher performance result. A feature reduction method based on information gain is to decide the importance of a feature in the movie review and multidomain dataset. The disadvantage of this method is to select a large number of features and does not consider duplicates in the features. It can be reduced by using an optimized feature selection. Our main objective is to design and develop a new classification algorithm which will enable improving the performance of the sentiment classification.

The rest of the paper is organized as follows. Section 2 presents state-of-the-art, related to this study. Section 3 gives problem outline. Section 4 presents our proposed feature reduction methods. Section 5 gives methodology. Section 6 presents evaluation models used in this study. In Section 7 we discuss the empirical results and Section 8 gives the conclusion of the study and future research direction of this study.

Several techniques were used for opinion mining tasks in history. The following few works relate to this study. The field of machine learning has provided many models that are used to solve various sentiment classification problems.

Among them are support vector machine, Naïve Bayes, decision trees, maximum entropy, and hidden Markov models. So far, the most popular machine learning approaches used as baselines are support vector machine (SVM) and Naïve Bayes (NB) [2].

In Pang et al. [2] study several machine learning algorithms were analyzed on a movie review dataset, together with different feature selection techniques. They used a binary unigram representation of patterns and directly apply the machine learning techniques. Training patterns are represented by the presence or absence of words instead of that counting the number of occurrences of words in the documents. When the machine learning approach was applied to the document they report the best performance using SVM method with unigram text representation using a movie review dataset. They achieved the best result using SVM based in unigram. They utilized Naïve Bayes (NB), maximum entropy (ME), and support vector machines (SVM). As per the results, on the movie review dataset 82.9% accuracy was achieved, while the NB method gave lower accuracy.

In the later study Pang and Lee [3] proposes first separate subjective sentence from the rest of the text. They assume that two consecutive sentences would have a similar subjective label, as the author is inclined not to change sentence subjectivity too often. Thus, labeling all sentences as objective and subjective, they reformulate the task of finding the minimum s-t cut in a graph. They carried out experiments on the movie reviews and movie plot summaries mined from the Internet Movie Data Base (IMDB), achieving an accuracy of around 85%.

To use the prior knowledge besides a document, Mullen and Collier [5] attempted to use the semantic orientation of words defined by Pang et al. [2] and several kinds of information from the Internet and thesaurus. They evaluated the same dataset used in Pang et al. [2] study and achieved 75% accuracy with the lemmatized word unigram and the semantic orientation of words.

Wiebe et al. [6] used review data for automobiles, banks, movies, and travel destinations. She classified words into two classes (positive or negative) and counts the overall positive or negative score for the text. If the documents contain more positive than negative terms, it is assumed as a positive document; otherwise, it is negative. These classifications are based on document and sentence level classification. These classifications are useful and improve the effectiveness of sentiment classification but cannot find what the opinion holder liked or disliked about each feature.

Zhang et al. [7] use customer feedback review and product review. They use decision learning method for sentiment classification. Decision tree learning is a method for approximating discrete valued target functions, in which the learned function is represented by a decision tree. Learned trees are also re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular of inductive inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants.

Chen and Chiu [12] proposed a Neural Network (NN) based index, which combines the advantages of machine learning techniques and semantic orientation indices to effectively classify sentiment. Tao and Tan [13] used emotional function words instead of emotional keywords to evaluate emotional states. Hu and Liu [14] used adjective synonym sets and antonym sets in WordNet to judge the semantic orientations of adjectives.

Ye et al. [10] report an evaluation of three supervised machine algorithms of Naïve Bayes, SVM, and character based -gram model for sentiment classification of the reviews and, in this study, they reported that all three approaches reached accuracies of at least 80% and also that SVM and -gram approaches outperformed the Navïe Bayes approach.

Zhang et al. [15] proposed a lexicon enhanced method for sentiment classification by combining machine learning and semantic orientation approaches into one framework. Specifically, they used the words with semantic orientations as an additional dimension of features for the machine learning classifiers. In general, sentiment analysis is concerned with analysis of direction based text, that is, text containing opinions and emotions. Sentiment classification studies attempt to determine whether a text is objective or subjective or whether a subjective text contains positive or negative sentiments. The common two-class problem involves classifying sentiments as positive or negative [3, 16]. Additional variations include classifying sentiments as opinionated/subjective or factual/objective [6]. Some studies have attempted to classify emotions, including happiness, sadness, anger, and horror, instead of the sentiments.

Xia et al. [17] ensemble framework is applied to sentiment classification tasks with the aim of integrating different feature sets and different classification algorithms to produce a more accurate classification procedure. The author has applied two types of feature sets for opinion mining and three well-known text classification algorithms, namely, Naïve Bayes, maximum entropy, and support vector machines, which are employed as a base classifiers for each of the feature sets and proposed three types of ensemble methods, namely, the fixed combination and weighted combination and the meta-classifier combination is evaluated for three ensemble strategies.

Liu et al. [18] proposed designs and developed a movie rating and review summarization system in a mobile environment. They used a sentiment classification approach based on Latent Semantic Analysis (LSA) to identify product features.

Hai et al. [19] proposed a method to identify opinion and features from online reviews and used one domain specific corpus as well as one domain-independent corpus. They used a measure called domain relevance, which characterizes the relevance of a term for a text collection. They used syntactic dependency rules to extract a list of candidate opinion and features of the domain review corpus and then estimated its intrinsic domain relevance and extrinsic domain relevance scores on the domain dependent corpus and domain specific corpus. Candidate features that are less generic are opinion features.

Kalaivani and Shunmuganathan [20, 21] examine how a classifier works with various sizes of feature set. In this study, information gain feature reduction method is applied to reduce the original feature set by removing irrelevant feature for sentiment classification of movie reviews and to select top % ranked attributes for training the classifier.

The method also evaluated the accuracy of movie domain data sets and used different feature weight schemes along with information gain feature selection method. They compared three supervised machine learning approaches such as SVM, Naïve Bayes, and KNN for sentiment classification of movie reviews.

2.1. Motivation

Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. The sentiment classification has been modeled as the problem of training a binary classifier using reviews for positive or negative. It is a growing field of research, driven by both commercial applications and academic interest. The sentiment analysis is used for identifying the rate of accuracy in positive, negative, and neutral reviews.

Various studies show sentiment classification on product review using machine learning algorithms [3, 17, 2224]. This helps us to conduct opinion mining on multidomain product reviews and movie reviews. Information gain is a popular feature reduction technique and is used in opinion mining [12]. The sentiment classification literature does not contribute any work using collective optimized feature reduction technique, information gain, and an ensemble method. In this study, we used information gain and optimized feature reduction technique, genetic algorithm and an ensemble method, bagged SVM, and Bayesian Boosting NB to perform the opinion mining task.

The main contribution of this study is to find the effect of unigram feature and joint feature. To build our opinion mining model, we used unigram and bigram features. In Test I only unigram is used as a feature and in Test II unigram and bigram are used as a feature for classification.

For each test various machine learning algorithms NB, LR, SVM, and ensemble methods, bagged SVM and Bayesian NB are used to conduct the experiment. The accuracy result and overall error rate are compared. The comparative results show that the hybrid model gives better result than single classifier.

3. Problem Outline

In this work, various machine learning algorithms are applied to classify the documents and to find set of opinion as positive or negative. To overcome the drawbacks such as unstable outcomes in the unigram feature selection, the IG integrated with genetic optimized feature selection for the supervised classification algorithm is formulated. Information gain feature reduction is applied to dataset to extract the relevant features for the domain. Reduced attributes are further analyzed to eliminate irrelevant attributes using the optimized feature selection based on the attribute weights. The attribute weight relation is set to top % and the value is set to 0.7. This section describes the opinion mining problem. The prediction model is as follows.

Input is as follows:  The review dataset , a set of training review dataset classifiers, is used as a learning scheme. In this work, we used three machine learning classifiers NB, LR, and SVM and hybrid model.

Output is as follows:  A predicted model.

Method is as follows:(a)To prepare, review documents, we performed tokenization and transformed all characters to lower case, stemming, and filter stop words. Tokenization operation splits the review documents into a sequence of words. Filter stopword operation removes every word which equals stop words from review documents using the predefined stop word list.(b)Feature measure scheme TF-IDF to convert text representation vector is as follows:(i)Test I uses unigram with TF-IDF.(ii)Test II uses unigram and bigram with TF-IDF.(c)The stratified sampling creates a random review subset of the whole document.(d)Evaluate the performance of the SVM classifier with and without IG and optimize selection.(e)Each test uses an IG feature reduction technique, optimized feature selection, and genetic algorithm that incorporates bagging and Bayesian with TF-IDF weighting scheme.(f)Calculate the relevance of attributes based on information gain and assign attribute weights to them accordingly.(g)Select attributes from input words whose weight satisfies the specified condition (with highest weight top 7%) with respect to the input weight.(h)Remove useless attributes.(i)The proposed model is used as a training dataset for learning models.(1)Develop a model using Naïve Bayes.(2)Develop a model using logistic regression.(3)Develop a model using support vector machine.(4)Develop a model using IG, optimized feature selection (GA), and an ensemble bagging technique incorporated support vector machine.(5)Develop a model using IG, an optimized feature selection (GA) and Ensemble Bayesian is boosting technique incorporated Naïve Bayes.(j)Effectiveness of each model is evaluated and prediction model is compared with the baseline method.

3.1. Corpora Description

The user’s opinions are the valuable sources of data which helps to improve the quality of service rendered. Blogs, review sites, and microblogs are some of the platforms where user expresses his/her opinions.

To conduct the study, movie reviews and multidomain datasets are considered here. The Cornell movie-review corpora (http://www.cs.cornell.edu/people/pabo/movie-review-data) consists of movie reviews dataset which contains 1000 positive reviews and 1000 negative reviews. The multidomain dataset (http://www.cs.jhu.edu/~mdredze/datasets/sentiment) contains product reviews, book, DVD, electronics, and kitchen; each of these contains 1000 positive and 1000 negative reviews.

In order to obtain a reduced feature for our problem, we applied stratified sampling for each domain. The number of samples, number of positive reviews and negative reviews, and total number of attributes, attributes reduced after applying information gain, and reduced attributes after applying an optimized feature reduction for each classification algorithms are given in Tables 1 and 2. The properties of the data source for unigram and joint feature word vector models are developed and Test I uses only unigram; Test II is represented as a word vector that uses unigram and bigram attributes.

4. Proposed Sentiment Classification Using Genetic Algorithm

The main objective of the feature selection is to reduce the number of features and the computational cost and to improve the performance of classification. It has been proved that feature reduction method is to remove the irrelevant and redundant feature and also increase the learning task, so it improves the efficiency of sentiment classification.

In this study, we use movie review dataset and multidomain dataset for evaluation, which involves splitting the available dataset into a training set and a testing set. We used a genetic algorithm that incorporates various machine learning algorithms to improve the performance of feature selection. Generally, we applied the NB, LR, and SVM algorithms to the dataset in the training set and evaluate the resulting model using the dataset in the test set. Most of the existing work shows that support vector machine and Naïve Bayes are perfect methods in sentiment classification [2, 3, 22, 2527]. So SVM and NB classifiers are used as base classifiers in our approaches.

The accuracy is measured using the SVM classification algorithm with and without information gain and an optimized feature reduction (GA). Figure 1 shows the performance of SVM algorithm. The accuracy is better with feature reduction IG and an optimal selection. Most of the work shows that SVM outperformed the other machine learning algorithms [25, 22, 26]. In our work SVM is one of the base classifiers.

4.1. Sentiment Analysis with Different Learning Tests

Generally, before building the sentiment analysis, we first need to decide which learning model should be used to construct feature selection model. We propose the use of genetic algorithm to improve the performance of opinion mining and to address the problems in sentiment analysis. The framework consists of two steps: learning type evaluation and sentiment analysis.

In the first test, learning type evaluation stage, the performance of the learning types is evaluated with multidomain review dataset and movie review dataset to decide which learning type performs better for sentiment analysis and we need to select the best learning type from a set of different learning types. The ten different learning types were considered according to the feature selection and machine learning algorithms:(a)Two-feature selection is as follows:(i)Information gain is one of the important feature selection measures used in sentiment classification, which outperformed the other feature selection methods [4, 10, 16]. It is based on the value or weight of information contained in reviews, which select important feature with respect to class attributes. The weight of each attribute with respect to the class is calculated by using information gain for each attribute which will vary from 0 to 1. The higher the weight of an attribute, the greater the information gain.(ii)In this study, genetic algorithm uses heuristic method to assign weights to various sentiment words or attributes [25, 28].(b)Three machine learning algorithms are as follows: NB, LR, and SVM.(c)Two ensemble methods are as follows: bagging and Bayesian boosting.(d)Ten learning types are as follows:  The combination of IG, GA, feature selection, three machine learning algorithms, and ensemble techniques gives a total of 10 different learning tests. In model I word vector model is represented by unigram and model II uses unigram and bigram attributes.(i)Model I:unigram + IG + GA + NB;unigram + IG + GA + LR;unigram + IG + GA + SVM;unigram + IG + GA + BSVM;unigram + IG + GA + BNB.(ii)Model II:joint feature + IG + GA + NB;joint feature + IG + GA + LR;joint feature + IG + GA + SVM;joint feature + IG + GA + BSVM;joint feature + IG + GA + BNB.

4.2. Improving the Efficiency of Hybrid Genetic Algorithm

Incorporating a local search into a genetic algorithm can increase the efficiency of the algorithm. The efficiency of the searching process is increased in terms of the time required to reach a global optimal solution and memory needed to process the population. The major steps in this study are as follows.

4.2.1. Initial Population

In GA, the initial populations of n strings are randomly generated and collection of such strings is called initial population [23]. The information gain feature weights are used as the final strings in the initial population. The information gain solution features are used as the solution string in the initial population. The solution features are represented using binary string character. Specifically 1 represents a selected attribute or feature and 0 represents the discarded one. Generate random population of individual. Each attribute is switched on with the probability Pi. In this study, the population size is set to 50 and the Pi value is set to 0.1.

4.2.2. Selection

To evaluate the quality of each solution, classification accuracy is used as the fitness function. For each solution in the population, tenfold cross validation with classification algorithm is used to assess the fitness of that particular solution. Solution for the next iteration is selected probabilistically and in this study tournament is used as the selection scheme. The size of tournament specifies the fraction of the current population, which should be used as a tournament member. The size of the tournament is set to 0.05. There are several population replacement methods such as generational replacement method and steady-state method. In generational replacement method, the entire population is replaced in every iteration, but, in steady state, fraction of the population is replaced in every iteration.

4.2.3. Crossover

Crossover is the process of exchange of information between two parents to produce a new offspring. Choose two individuals from the population and perform crossover based on a crossover probability . The probability is set to 0.6. Different crossover types such as single point, uniform, and shuffle crossover are used. We use uniform crossover by selecting two individuals and swapping substring at a randomly determined crossover point x. If the mixing ratio is 0.5, then half of the genes in the offspring will come from parent 1 and half will come from parent 2. Mutation is randomly mutated individual feature characters in a solution string based on a fixed probability . The mutation probability is set to 0.01.

5. Methodology

Opinion mining is conducted at any of the three levels, the document level, the sentence level, and the attribute level. In this study, we applied supervised machine learning models for sentiment classification of reviews for the selected movie reviews and product reviews. The models are NB, LR, and SVM algorithm together with genetic algorithm which uses information gain as feature reduction technique. In this work NB, LR, SVM, and hybrid model are applied to classify the documents and find a set of opinion as positive or negative.

5.1. Naïve Bayes

The basic idea is to find the probabilities of the categories given a review document by using the joint probabilities of words and categories. It is based on the assumption of words being conditionally independent.

The starting point is the Bayes theorem for conditional probability, stating that, for a given data point and class , let be the training dataset and their associated class labels and each dataset is represented by attribute space vector . The classification is to derive the maximum posteriori,Class is positive and is negative. The probability of each of its attributes occurring in a given class is independent, when we estimate the probability of as follows:

Training a Naïve Bayes classifier, therefore, requires calculating the conditional probabilities of each attribute occurring in the classes, which can be estimated from the training dataset.

5.2. Logistic Regression

Logistic regression is one of the standard techniques used for applying statistics and discrete data analysis. It is based on maximum likelihood estimation. It is used to predict positive or negative class response from positive and negative attributes and predicting the outcome of the class label based on the positive or negative attributes.

5.3. Support Vector Machine

In this work, support vector machine classification algorithm is applied to classify the review documents and find a set of opinion as positive or negative. It has been shown that this is an effective classification algorithm and also is used in sentiment analysis. This algorithm outperformed the other classification algorithms [29]. The SVM finds hyper plane using support vectors. This approach was developed by Vladimir Vapnik, Bernhand Boser, and Isabelle Guyana in 1992.

5.4. Bagging Technique

Bagging technique is used to improve the classification model in terms of classification accuracy. The basic idea of this technique is to construct members from the training dataset. The bootstrap aggregating splits the training dataset into several new training datasets by sampling and model is built based on the new training dataset. For the training dataset of size , bagging generates new training dataset by sampling with replacement. The classifier is trained on each training dataset and the new training dataset is equal to the original training dataset, so that the bagging technique produces better results than a single model [17, 24, 30]. We obtained best accuracy, using unigram, TF-IDF feature weighting scheme, and information gain feature selection with 10-fold cross validation. The idea of improving the supervised classification by randomly generated training dataset was proposed by Leo Breiman in 1994. The bagging is also referred as bootstrap aggregation.

We used 10-fold cross validation to measure the performance of sentiment classification. It has two subprocesses, one is a training subprocess and another one is testing subprocess. We considered movie review dataset and multidomain dataset , of documents. In the training subprocess for each iteration () a training dataset of document samples with replacement. Some of the original dataset may not be included in . This method generates set of classifier models . The bagging method separates training dataset into several new training datasets by random sampling. The training subprocess is used for training a model and the trained model is applied in the testing phase. In the first iteration are jointly served as the training set in order to obtain a first model, which is tested on ; the second iteration is tested on subsets , and tested on and so on. During the testing subprocess, the performance of the model is generated.

The bagged classifier counts the vote and assign the class with the most votes to testing data set.

The bagging algorithm is as follows.Input is as follows:(i)The review dataset , a set of training review dataset.(ii)Number of models in the classifier.(iii)Classifier is used as a learning scheme (in this we used SVM machine learning classifier).Output is as follows:(i)a composite model .Method is as follows:(i)Training subprocess:  For to do//create models.  Create bootstrap sample by sampling original review dataset with replacement use to derive a model .  End for(ii)Testing subprocess:If classification then(i)Let each of the models classify testing dataset and return the majority vote.

5.5. Bayesian Boosting

The Bayesian boosting algorithm is an iterative machine learning algorithm based on Bayes’ theorem which is used to improve the performance accuracy. This method is an ensemble of classifiers for product review attributes. At each iteration the training data set is reweighted. We apply the Naïve Bayes algorithm several times, and all the models are combined into a single model. In this process, the number of iterations is set to be 10 [17, 31].

6. Performance Evaluations

In this study, we use movie review dataset and multidomain dataset and the evaluation involves splitting the available dataset into a training set and a testing set. We use a genetic algorithm that incorporates hybrid model to improve the performance of feature selection. Generally, we applied the NB, LR, and SVM algorithms to the dataset in the training set and evaluate the resulting model using the dataset in the test set.

The cross validation method involves partitioning the dataset randomly into 10-fold. We use one partition as a testing set and the remaining partitions to form training set. We repeat this process 10 times, each of the partitions as the testing dataset and the remaining partitions to form a training set. In this work, four evaluation measures, accuracy, overall error rates, type I error, type II error, sensitivity, and specificity are used to test the effectiveness of opinion mining. Once we selected an algorithm and an evaluation methodology, we need to select a performance metric. For two class problems, a test case will be either positive or negative. This yields four quantities that we can compute by applying a model to a set of test cases, as shown in Table 6.

For a set of test cases, let be the number of times the model predicted positive when the reviews label is positive, let be the number of times the model predicted negative when the reviews label is positive, let be the number of times the model predicted positive when the reviews label is negative, and let be the number of times the model predicted negative when the reviews label is negative. Given these counts, we can define a variety of common performance metrics. The accuracy or recognition rate of a classifier on a test review is the percentages of the test dataset that are correctly classified by the classifier as explained in the following:

An overall error rate or misclassification refers to the number of wrongly classified reviews by the total number of sample review. Type I error refers to negative sample reviews that were wrongly classified as positive reviews. Type II error refers to positive sample reviews that were wrongly classified as negative reviews:  Overall error rate (%) = ()/total number of samples.  Type I error rate (%) = /total number of positive samples.  Type II error rate (%) = /total number of negative samples.

7. Results and Discussion

To evaluate our model, we used Cornell movie review datasets and the multidomain dataset which are frequently used in the sentiment classification. The multidomain dataset contains 1000 positive and 1000 negative documents. The movie review dataset contains 1000 positive reviews and 1000 negative reviews. It is a challenging task because the reviewers use a lot of comparisons and sometimes used an unclear language. The performance results are shown in Tables 3, 4, 5, 7, and 8.

7.1. Comparison of Classifiers

Many classification algorithms are available for sentiment classification such as SVM, NB, KNN, maximum entropy, and decision tree. In this study, we used three classification algorithms NB, LR, and SVM and ensemble method along with IG and optimized feature reduction method; among all these methods bagged SVM is shown to perform better. The performance for each classification is shown in Tables 35, 7, and 8. The best accuracy value compared to the baseline accuracy is shown with an up arrow.

7.2. Performance of Individual Classifier

In this study, we use accuracy and overall error rate to evaluate our proposed approach on the movie review data set and multidomain dataset. Information gain feature selection is used to reduce feature vector space and TF-IDF feature weighting schemes were utilized and selected top percent attributes with higher weights are selected for training the classifier where the value is set to 0.7. All the experiments were validated using 10-fold cross validation. Tables 35, 7, and 8 show the experimental results when using the classifier together with genetic algorithm. It is an optimized feature reduction technique. Information gain is selected as a feature reduction method, because it outperforms the other feature reduction methods [23]. The classification results of NB show that accuracy result is comparatively lesser than all other individual classifiers and hybrid model. The overall error rate is higher than all other results. NB is not an efficient algorithm on unigram and joint feature. The reason for higher error rate performance is that all features are independent. Type I error of NB is higher than other classifiers. This shows that this model predicts negative reviews that were incorrectly classified as positive reviews for unigram and joint feature. Type II error is lesser than LR but higher than the other classifiers.

The classification results of LR show that accuracy result is higher than NB model but lesser than other classifier. Type I error rate is comparatively lesser than all other classifiers, which indicates that positive reviews were correctly classified as positive reviews, but type II error rates are higher than all other classifiers which indicates that positive reviews were incorrectly classified as negative reviews. Tables 9 and 10 show the results of type I error and type II error in percentage. The classification results obtained from book reviews are given in Table 3. In Table 3, the accuracy results of bagged SVM show that it is comparatively higher than other classifiers. The overall misclassification rate is comparatively lesser than all four classifiers. This indicates that the bagged SVM model predicts positive reviews more accurately than negative reviews for unigram feature and joint feature.

The performance results are compared and bagged SVM classifier result is much better than other classifiers. The bagged SVM achieves best accuracy value of 83.21% for book review using joint feature. The hybrid model Bayesian achieves best accuracy value of 85.39% for DVD reviews, 87.00% for electronics reviews, 81.58% for kitchen reviews with bagged SVM, and 89.50% for movie reviews using unigram. We give the results of five classifiers (NB, LR, SVM, bagged SVM, and Bayesian NB) and observe the performance of different classifiers.

7.3. Statistical Significance Test

We applied McNemar’s statistical test to compare the performance of classifiers [3133]. The comparisons of statistical test result show that the bagging method performs better than other classifiers. In Table 11, the denotes the count of the number of times that both classifiers failed.

The denotes the count of the number of times that both classifiers failed. Both classifiers predict positive reviews as negative reviews and vice versa. The denotes the count of the number of times that classifier A succeeded but classifier B failed; that is, classifier A predicts positive reviews as positive reviews and negative reviews as negative reviews, but classifier B predicts positive reviews as negative reviews and vice versa. The denotes the count of the number of times that classifier B succeeded but classifier A failed; that is, classifier B predicts positive reviews as positive reviews and negative reviews as negative reviews but classifier A predict positive reviews as negative reviews and vice versa. The denotes the count of the number of times that both classifiers succeeded. Both classifiers predict positive reviews as positive reviews and vice versa.

The null hypothesis and alternative hypothesis are  H0: both classifiers perform similarly,  H1: one of the classifier performs differently.

The McNemar test statistics is

When value is zero, two classifiers perform similarly; when the value is increased one of the classifiers performs differently. Based on the value, we should accept H0 or reject H1 or vice versa. We need to decide which classifier performed better based on the and values of the two classifiers. If value is smaller than , classifier B is said to perform better than classifier A.

In Tables 1216, the symbol “” denotes that classifier A performed better than classifier B because value is smaller than value. In Tables 1216, the symbol “” denote that classifier B performed better than classifier A because value is smaller than .

By looking at the McNemar’s test result for the multidomain reviews and movie reviews (see Tables 1216) it can be observed that BSVM has produced significantly better results than NB, LR, SVM, and BNB. H1 is accepted with level of significance at 5% right tailed test. SVM and LR classifiers performed better than NB classifier. BNB classifier performed better than NB, LR, and SVM classifiers. For DVD, electronics and movie reviews LR classifier works better than SVM. For book reviews and movie reviews BNB classifier works better than BSVM.

To compare effectiveness of ensemble technique and other classifiers for opinion mining, the performance of SVM was considered as a baseline. An improvement of different models was calculated asAs shown in Figure 2, the hybrid model gives best result using SVM as base classifier. The positive value indicates that a hybrid model has an increasing average accuracy with respect to SVM. As shown in Figures 35, the negative value indicates that a hybrid model has a decreasing error rate with respect to baseline model SVM.

The confusion matrix of sentiment classification of the two-class multidomain dataset and movie reviews using genetic bagging is tabulated in Table 17. The accuracy is estimated by means of the confusion matrices. The accuracy and an overall error rate are calculated for genetic bagging with different attribute weight values for two-class review dataset. In the sentiment classification exercise, feature weight relations greater than 0.100, 0.200, 0.300, 0.400, and 0.500 are used. Here, the number of individuals in the population pool for the GA algorithm is 50. The convergence of accuracy value reaches in 150 generations. So, the maximum number of generations is 150. The resulting multidomain review dataset classification was more accurate with the increase in attribute weight value. The proposed hybrid method results using information gain and genetic incorporated baggingtechniqueusing SVM as a base classifier performs better in terms of accuracy.

To compare effectiveness of genetic bagging technique for opinion mining the performance of genetic algorithm without bagging was considered as baseline. As shown in Figure 6, the hybrid model gives the best result using SVM as base classifier. The positive value indicates that hybrid model has increasing average accuracy with respect to GA.

7.4. Threats to Validity

In this work, we used only bag-of-words, information gain, and optimized feature reduction that incorporates ensemble methods of machine learning approaches that used to improve classification performance. It considers only positive reviews and negative reviews and does not consider neutral reviews for sentiment classification. Multidomain dataset and movie review datasets are imbalanced; large datasets should be considered for further validate the result of the study. In future, attribute construction based on other feature reduction methods should be considered.

8. Conclusions

The main aim of the study is to improve the performance of opinion mining. We proposed an optimized feature reduction that incorporates ensemble methods of machine learning approaches that uses information gain and genetic algorithm as feature reduction techniques to improve classification performance. The results show that feature selection based on genetic algorithm along with an ensemble approach outperformed the other approaches. We conducted comparative study experiments on multidomain dataset and movie review dataset in opinion mining. The effectiveness of single classifiers, Naïve Bayes, logistic regression, support vector machine, and ensemble technique for opinion mining, is compared on five datasets. The proposed hybrid method is evaluated and experimental results using information gain and genetic algorithm with ensemble technique perform better in terms of various measures for movie, book, DVD, electronics, and kitchen reviews. Five classification algorithms are evaluated using McNemar’s test to compare the level of significance of the classifiers. A direction for future work is to study the performance of feature selection methods on different machine learning classifiers and to evaluate the model for sentiment analysis with other domain reviews.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.