Abstract

Feature selection (FS) is a fundamental task for text classification problems. Text feature selection aims to represent documents using the most relevant features. This process can reduce the size of datasets and improve the performance of the machine learning algorithms. Many researchers have focused on elaborating efficient FS techniques. However, most of the proposed approaches are evaluated for small datasets and validated using single machines. As textual data dimensionality becomes higher, traditional FS methods must be improved and parallelized to handle textual big data. This paper proposes a distributed approach for feature selection based on mutual information (MI) method, which is widely applied in pattern recognition and machine learning. A drawback of MI is that it ignores the frequency of the terms during the selection of features. The proposal introduces a distributed FS method, namely, Maximum Term Frequency-Mutual Information (MTF-MI), based on term frequency and mutual information techniques to improve the quality of the selected features. The proposed approach is implemented on Hadoop using the MapReduce programming model. The effectiveness of MTF-MI is demonstrated through several text classification experiments using the multinomial Naïve Bayes classifier on three datasets. Through a series of tests, the results reveal that the proposed MTF-MI method improves the classification results compared with four state-of-the-art methods in terms of macro-F1 and micro-F1 measures.

1. Introduction

Feature selection (FS) plays a key role in data mining [1], especially in text classification task that suffers from large dimensionality [2] in many application domains such as sentiment analysis [3], emotion identification [4, 5], and spam filtering [6]. Feature selection aims to select relevant and informative features (words) from large datasets [7]. Therefore, FS can reduce space dimensionality, decrease the running time in the classification process, and improve the efficiency of machine learning algorithms [8]. For this aim, FS is considered as a critical technique because it directly affects the accuracy of classification.

The FS methods can be divided into two major categories, namely, filter and wrapper methods [9]. Filter approach methods perform a statistical analysis of the feature space to select a distinguishing subset of features. Wrapper methods employ a search strategy to determine the goodness of a feature subset by providing it to the classifier and evaluating the performance. These two steps are repeated until reaching a suitable quality feature subset for a specific classifier. Wrapper methods primarily achieve better classification results than filter methods; however, they have a very high computational complexity [10] and are only efficient when the number of features is relatively small [11]. In contrast, the filter methods are efficient, scalable, and independent of any classifier interaction during the construction of the feature set. The need for classifier interaction may increase the execution time and make the FS method valuable only to a specific learning algorithm. Thus, filter methods are more suitable for large datasets [12].

Moreover, although most available FS methods for text classification are filter-based, these methods do not work when the datasets are large because they are based on the serial programming model. More precisely, classical FS algorithms need to read data into memory for analysis, but a limited memory cannot deal with the storage and processing of large datasets. Thus, FS methods are needed for distributed environments, such as Hadoop, a powerful tool for distributed storage and distributed processing of large datasets [13]. Figure 1 presents a general overview of the distributed process of the filter FS approach for text classification.

Motivated by the above challenges, we introduce a parallel filter-based FS method for textual big data implemented on Hadoop. To this end, the proposed method focuses on the reduction of features using the term frequency (TF) [1] and mutual information (MI) techniques [14]. The MI technique is one of the most used filter FS techniques. However, the drawback of MI is that it chooses terms with high document frequency (DF) and low TF for features, which amplifies the importance of the low-frequency terms. Therefore, terms with low DF and high TF are not selected, which decreases the classifier performance because these terms are discriminative in classification.(1)Documents are labeled and loaded into the Hadoop framework.(2)An algorithm is introduced to calculate the TF values of features. Then, the average and maximum values of the TF for each feature are estimated based on the category under the Hadoop framework.(3)An algorithm is proposed to calculate the MI value to evaluate the relationship between features and categories under the Hadoop framework.

In this paper, we present a hybrid distributed FS approach using the MapReduce paradigm to improve classification of textual big data. The proposal aims to select features with both high frequency and high feature-category dependency. Besides its independence from the classifiers, the proposed method is scalable and efficient for textual big data. The performance of the proposed method was compared with several state-of-the-art methods using three datasets, 20-Newsgroups, Reuters-21578, and WebKB, using multinomial NB as a classifier. According to the reported results, we can show that the proposal is outperforming standard methods.

The remainder of this paper is structured as follows. Section 2 introduces a brief literature review highlighting related work. In Section 3, the technical background in this work is discussed. The proposed method is explained in Section 4. Section 5 describes the experimental results, including the datasets, classifier, and performance measures used in the experiments. Finally, Section 6 presents the conclusion and future work.

This work is focused on MI and parallel FS methods. Therefore, in the following context, we briefly present some related works on these two aspects.

Hadoop is the most used open-source MapReduce software to handle big data [15]. In [16], the authors presented a parallel FS method using MapReduce for text classification. Moreover, MI based on Renyi entropy was used to measure the correlation between features and classes. Then, the maximum MI theory was used to generate the most distinguishing feature subset. In [17], the authors investigated the design and scalability of an MI-based algorithm, which is the minimum redundancy maximum relevance algorithm in MapReduce, and examined its performance in dense and sparse data.

In [18], the authors proposed a high-dimensional FS algorithm based on a variance study. The algorithm selects features by estimating their capacities to justify data variance. In [19], the authors explored a parallel FS method based on MI. However, the mentioned method is only applied to process discrete variables. In [20], the authors implemented a set of FS techniques based on a statistical test. All methods were parallelized using MapReduce on the Hadoop platform, and each feature was estimated independently. In [21], the authors introduced a MapReduce approach to derive a subset of features from large datasets. The proposed method was evaluated using classifiers, such as Support Vector Machine, Naïve Bayes, and Logistic Regression. The measurements revealed that the spark implemented framework was useful to perform evolutionary FS on massive datasets with improved classification precision and execution time.

In [22], the authors proposed a parallel FS algorithm, namely, the parallel forward-backward with pruning algorithm, for large datasets. The experimental study established increased scalability with running time. In [23], the authors proposed using MI to reduce dimensionality and improve accuracy for online streams. The proposed study focused on presenting a methodology to address the computational cost, the stability of the generated results, and the size of the final subset of selected features. In [24], the authors introduced a hybrid FS algorithm for a gene dataset by combining the MI maximization and adaptive genetic algorithm (MIMAGA) to improve the competence of the MIMAGA algorithm. In [25], the authors proposed an evaluation of the MI-based FS methods.

In [26], the authors considered MI-based FS to increase the searching ability of the relevant subset of features. Based on MI, many studies have recently focused on maximizing the relevance of variables while minimizing variable redundancy to improve the quality of the selected features and reduce the space dimensionality [2729].

In most of the works on FS, researchers have worked on binary classification rather than textual datasets. Selecting the most relevant features from a large volume of data has become the most significant challenge in many applications, especially in text classification [30]. As the amount of the data continues to grow, conventional algorithms cannot adapt in terms of memory requirements, execution time, and efficiency of the results. Thus, to address these large-dimensional problems, this work proposes selecting characteristics for text classification using the multicluster environment of Hadoop.

3. Technical Background

This section presents some basic concepts associated with the proposed FS approach, MTF-MI, and the parallelization technology used in our implementation (MapReduce).

3.1. Representation Phase

In this section, we denote as the set of categories. Broadly, the documents from dataset are represented using word vectors. This representation is generated by the vector space model that uses the bag-of-words approach [31]. Thus, a text document of a category is represented by a vector of features in this document. The document is denoted by vector , where is the number of terms in document .

3.2. Mutual Information (MI)

The MI is an essential concept in information theory. It is used to measure the degree of correlation between two random events [32]. In FS, MI is often used to represent the relationship between a feature and category. The MI between a feature and a category is defined as follows:

The approximate formula is the following:where A is the number of documents in containing , B is the number of documents not in containing , C is the number of documents in not containing , and is the total number of training documents.

Because MI does not consider the frequency of features in a text document, if two features appear in a document, their MI value is the same regardless of how often they occur. Thus, it is also necessary to consider the feature frequency in each document of the training dataset.

3.3. Hadoop Parallel Distributed Architecture

Faced with the continuous growth of data, traditional data analysis systems cannot store and process such a large volume of data. Thus, the best solution to manage the abundant data is to store it in the Hadoop distributed file system (HDFS). Due to its fault tolerance mechanism, the HDFS allows Hadoop to operate reliably and very efficiently. The HDFS can be viewed just like a regular file system; the only difference is that it handles larger datasets. This system splits data into 64 MB blocks by default, making it more efficient for large datasets. The data in HDFS are stored in two forms: the actual data and its metadata, such as file location and file size. Application data are stored in the data nodes of the HDFS, and the metadata are stored in the name node. The architecture of the parallel HDFS is illustrated in Figure 2.

The HDFS is the storage unit of Hadoop, and it follows the master-slave architecture. The master node includes three elements: the job tracker, name node, and secondary name node, whereas the slave node includes the task tracker and data node. The name node in the parallel HDFS architecture interacts with different data nodes residing in the slave nodes, whereas the job tracker in the master node organizes the task trackers on the slave nodes.

3.4. MapReduce

MapReduce is a programming model used in a distributed and parallel environment for processing large datasets [33]. The data processing in MapReduce is based on input data distribution; several tasks across many nodes execute the distributed data. A MapReduce program is divided into two main phases, map and reduce, and is executed in three steps: map, shuffle, and reduce. Figure 3 depicts the architecture of MapReduce. In the map step, input data are partitioned among nodes, and each partition of data is given as an input to a job that performs the map function. Each job processes the data and outputs key-value pairs. In the shuffle step, key-value pairs are grouped by key, and each group is sent to the reducer. The map and reduce functions are defined depending on the purpose of the application. The input and output of these functions are based on the key-value scheme. Thus, the MapReduce model allows the user to focus on the application without concern about issues, such as the program execution process on the distributed nodes, memory management, and fault tolerance. Apache Hadoop is a widely used open-source implementation of the MapReduce model.

4. Proposed Method

To deal with the problems described above, we introduced an improved MI FS approach called MTF-MI. This method introduces TF and term distribution to the classical MI method. The entire process of the proposed approach is described in Figure 4.

After the preprocessing step, including removing stop words, tokenization, and stemming, the documents are loaded into the HDFS and are represented as described in Section 3. To redress the drawback of the traditional MI method, the TF is introduced first. represents the TF of a term in a document . Hence, the average term frequency and the maximum term frequency for a specific category can be calculated as follows:where is the number of documents belonging to category . As the MI method is based on DF, according to its classical formula, if a term occurs many times in a document of a particular category when this type of document is rare, this term is not considered discriminative. Therefore, in this work, the TF is introduced in the MI formula. The term distribution is used to select more discriminative features. For a particular category, a feature has more discriminating power if it is regularly distributed. For this, the sample variance is used to evaluate the difference in term distributions. Sample variance is a commonly used statistics metric that measures the dispersion degree of a dataset. The sample variance is given as follows:

The variable denotes a very small real number. Finally, we introduce our approach based on the TF and term distribution to evaluate the feature in category as follows:

In the proposed method, to select terms with high discriminability power, as the TF is high and the DF is relatively low, we use the maximum TF , instead of the average. Based on the basic theory of MI, the greater the value of MI is, the more category information the feature has. Hence, the formula is defined as follows:where is the total number of categories in the dataset.

Finally, the proposed method is implemented using MapReduce. The parallel implementation of the overall MTF-MI includes three process stages: job 1, job 2, and job 3.

Job 1 is achieved using Algorithm 1. Job 1 reads the incoming batch of training data and calculates the number of documents in each category. The results are used in job 2, which is achieved using Algorithm 2. For each term belonging to category , the total and maximum TF of are calculated and stored in and , respectively. Then, using the value of , the average is calculated. Next, the difference in the term distributions is calculated for each term in category . Finally, the proposed approach calculates the value of term in category . Job 3 is achieved using Algorithm 3. Job 3 takes the values emitted by job 2 and assigns each term to the category with the maximum score. Then, all features are sorted in descending order, and the terms whose values are maximal are selected as the relevant features.

(i)Map
(ii)Input:
(iii)key: document name
(iv)value: document content
(v)
(vi)EndMap
(vii)Reduce
(viii)Input:
(ix)key:
(x)values:
(xi) //total number of documents in the category
(xii)for each value in values do
(xiii)
(xiv)
(xv)EndReduce.
(i)Map
(ii)Input:
(iii)key: Offset
(iv)value: line of document
(v)
(vi)EndMap
(vii)Reduce
(viii)Input:
(ix)key:
(x)values:
(xi)for each value in values do
(xii)
(xiii)
(xiv) for each value in values do
(xv)
(xvi)-
(xvii)
(xviii)EndReduce.
(i)Map
(ii)Input:
(iii)key:
(iv)value: MTF-MI ()
(v)
(vi)EndMap
(vii)Reduce
(viii)Input:
(ix)key:
(x)values:
(xi)
(xii)
(xiii)EndReduce.

5. Experiments

The multinomial NB classifier [34] is used on three different datasets with different characteristics to validate the performance of the proposed MTF-MI. Two different commonly used measures, micro-F1 and macro-F1, were applied to observe the effectiveness of the MTF-MI method. The datasets and evaluation measures are briefly described in the following sections, and the experimental results are also presented.

5.1. Datasets

We achieved experiments with the Reuters-21578 [35], 20-Newsgroups [36], and WebKB [37] datasets. The Reuters-21578 dataset contains the news that appeared on the Reuters newswire in 1987 and belong to one out of eight possible categories. The 20-Newsgroups dataset contains around 20,000 documents that are taken from the Usenet newsgroup collection, and all documents were assigned uniformly to 20 different categories. The WebKB dataset is a subset of web documents, which contains 877 webpages from the computer science departments of four universities.

5.2. Naïve Bayes Classifier

The Naïve Bayes (NB) classifier is a simple probabilistic algorithm based on the Bayes theorem with a strong independence assumption [38]. The NB model is based on simplifying conditional independence assumption, which consists of, given a category, the words which are conditionally independent of each other. This assumption does not affect the accuracy of text classification and makes fast classification algorithms applicable to the problem. For this, NB is widely used in classification problems in real-world applications.

5.3. Performance Measures

In this study, two commonly used measures are employed, which are the macro-F1 and the micro-F1 [39]. In macro-F1, F-measure is calculated for each category within the dataset and then the average over all classes is obtained. Hence, the same weight is assigned to each category without regarding the class frequency. Macro-F1 can be formulated as follows:where couple of corresponds to precision and recall values of class , respectively.

However, in micro-F1, the F-measure is computed globally without class discrimination. In this way, all classification decisions in the whole dataset are considered. If the classes in a dataset are biased, large classes dominate small ones in microaveraging. Micro-F1 can be formulated as follows:where pair represents the precision and recall values, respectively, over all the classification decisions within the whole dataset, rather than the individual classes.

5.4. Results and Discussion

The macro-F1 and micro-F1 performances of MTF-MI are compared to four widely used feature selection techniques using Naïve Bayes classifier applied on three datasets (20 Newsgroups, Reuters-21578, and WebKB). The four feature selection techniques used for comparison are the classical MI, Chi-square (CHI), Term Frequency-Inverse Document Frequency (TF-IDF), and Information Gain (IG).

Figures 57 show the classification performance of the different feature selection methods for the three datasets. In all figures, the horizontal and vertical axes present the number of selected features and the corresponding classification performance, respectively.

Figure 5 presents the F1 classification performance based on 5 term weighting methods using NB classifier with different feature dimensionalities. It is noticeable that the proposed approach outperforms all other standard methods in terms of micro-F1 and macro-F1. Figure 5 shows that macro-F1 and micro-F1 of MTF-MI are close to those of CHI when 500 and 1000 features are selected. It is noticeable that the IG and MI techniques showed the lowest performance. The micro-F1 results are noticed to be high (more than 83%) using 3500 features, and the highest classification F1 value (82%) is achieved by the MTF-MI method. Moreover, it is noticeable that proposed method performs well for less than 1500 features as its F1 values range between 54% and 64%, while the performances of other methods were very weak on the same range of features. Although the categorical documents distribution in the Reuters-21578 dataset is highly skewed, the results show that NB classifier performs better on the representation of the proposed method MTF-MI. In the Reuters-21578 dataset, the boundaries between categories are apparent. Therefore, good classification performance can be achieved with a small number of features (3500). However, when the number of selected features increases, the classification performance decreases.

Figure 6 depicts the NB classification performance on the 20-Newsgroups dataset in terms of F1 measure, where it can be seen that the trend of the micro-F1 and macro-F1 performance is similar to that in Figure 5. Similar to the results of Reuters-21578 dataset, the proposed method outperforms other standard methods in micro-F1 and macro-F1. For instance, the best three micro-F1 and macro-F1 values (90%, 91%, and 92%) are reached by the MTF-MI method based on 4000 features. In contrast to the results in Figure 5, the performance of CHI method, as seen in Figure 6, is not competing with the performance of the proposed MTF-MI for features up to 3000. For example, the micro-F1 and macro-F1 values (66% and 65%, respectively) are reached by the CHI method on 3000 features, which are still less than the corresponding values of the proposed method (87% and 86%). Finally, the documents in 20-Newsgroups are almost uniformly distributed; therefore, the micro-F1 and macro-F1 performances of different schemes are noticed to be quite similar. In addition, the measure values increase as the feature number increases, which could be due to the similarity of some categories in the 20-Newsgroups dataset. Therefore, some terms are commonly present in more than one category, so when the number of selected features increases, it provides a better distinction between categories.

Figure 7 shows micro-F1 and macro-F1 classification performance on the WebKB dataset using NB classifier. Generally, the results in Figure 7 are similar to those in Figure 5 for standard weighting techniques, as the boundaries between categories are apparent. The proposed MTF-MI method outperforms other techniques in terms of micro-F1 and macro-F1, where the maximum micro-F1 value (86%) is achieved by MTF-MI on 3500 features. Moreover, similar to the results on Reuters-21578 and 20-Newsgroups datasets, the proposed MTF-MI has outperformed other methods with noticeable performance differences.

It can be concluded that the proposed method MTF-MI performs the highest on different corpora, which indicates that the proposed approach is effective in selecting the features and representing the data as well as their generality. Based on the experimental results on different datasets, the performance of the proposed method is more effective for the three datasets, which means that the maximum term frequency factor introduced to the classical MI plays a big role to reach high performance. Therefore, it can be concluded that the proposed MTF-MI method is more effective than the classical state-of-the-art method.

6. Conclusion

This paper introduces MTF-MI, a distributed feature selection approach designed upon the MapReduce programming model. The proposed approach, based on mutual information method, has been implemented using Apache Hadoop, and it has been applied over three different large datasets. The performance of resulting classification models generated by MTF-MI has been systematically evaluated using Naïve Bayes classifier, implemented in Hadoop framework, over a cluster of five computers. The experimental study has proved that MTF-MI efficiently improves the selection of the relevant features while discarding the selection of irrelevant ones. The proposed approach is the best in average of F-measure compared to four state-of-the-art methods, namely, CHI, TF-IDF, MI, and IG. However, this method becomes less performed for a given threshold of selected features. Although the results vary within the datasets, the general insights provided here help highlight the importance of the combination of the feature selection techniques with the distributed aspect that is added through Hadoop framework usage for the prediction tasks on large textual datasets.

As part of this work, we have also compared the proposed approach with a sequential version of MTF-MI implemented on a single machine using java. Our results showed that the sequential version is unable to handle large datasets due to memory requirements. Meanwhile, our version is fully scalable and yields better memory usage when dealing with very large datasets. Despite the multiple advantages of parallelism, it can be hazardous if not used appropriately. When large and complex datasets are used, overparallelism can cause the distribution to ignore certain meaningful relationships between features, which can negatively affect the accuracy of the results.

Data Availability

20-Newsgroups dataset is from https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups. Reuters-21578 dataset is from https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection. WebKB dataset is from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz. The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.