Abstract

Text preprocessing is one of the key problems in pattern recognition and plays an important role in the process of text classification. Text preprocessing has two pivotal steps: feature selection and feature weighting. The preprocessing results can directly affect the classifiers’ accuracy and performance. Therefore, choosing the appropriate algorithm for feature selection and feature weighting to preprocess the document can greatly improve the performance of classifiers. According to the Gini Index theory, this paper proposes an Improved Gini Index algorithm. This algorithm constructs a new feature selection and feature weighting function. The experimental results show that this algorithm can improve the classifiers’ performance effectively. At the same time, this algorithm is applied to a sensitive information identification system and has achieved a good result. The algorithm’s precision and recall are higher than those of traditional ones. It can identify sensitive information on the Internet effectively.

1. Introduction

The information in the real-world is always in disorder. Usually, we need to classify the disordered information for our cognition and learning. In the field of information processing, a text is the most basic form to express information, such as the news, websites, and online chat messages. Therefore, text classification is becoming an important work. Text classification (TC) [1] is the problem of automatically assigning predefined categories to free text documents. Vector Space Model (VSM) [2, 3] is widely used to express text information. That is, a text can be expressed as , where , and are features of the text and , and are the weights of the features. Even a moderate-sized text collection consists of tens or hundreds of feature terms. This is prohibitively high for many machine learning algorithms, and only a few neutral network algorithms can handle such a large number of input features. However, many of these features are redundant, which will influence the precision of TC algorithms. Hence, the problem or difficulty in the research of TC is how to reduce the dimensionality of features.

Feature selection (FS) [4, 5] is to find the minimum feature subsets that can represent the original text; that is, FS can be used to reduce the redundancy of features, improve the comprehensibility of models, and identify the hidden structures in high-dimensional feature space. FS is the key step in the process of TC, since it is the preparation of classification algorithms, and the precision of FS has direct impact on the performance of classification. The common methods of FS often use machining learning algorithms [6], such as Information Gain, Expected Cross Entropy, Mutual Information, Odds Ratio, and CHI. Each algorithm has its own advantages and disadvantages. In the English data sets, Yang and Pedersen’s experiments showed that the Information Gain, Expected Cross Entropy, and CHI algorithms get better performance [7]. In Chinese data sets, Liqing et al.’s experiments showed that the CHI and Information Gain algorithms get the best performance and the Mutual Information algorithm is the worst [8]. Many researchers employ other theories or algorithms on FS; references [9, 10] have adopted Gini Index into FS and achieved good performance. Gini Index is a measurement to evaluate the impurity of a set, which describes the importance of features in classification. It is widely used for splitting attributes in decision tree algorithms [11]. That is to say, Gini Index can identify the importance of features.

On the other hand, feature weighting (FW) is another important aspect of FS. TF-IDF [12] is a popular FW algorithm. However, TF-IDF is unsuitable for text FW. The basic idea of TF-IDF is that a feature term is more important if it has a higher frequency in a text, known as Term Frequency (TF); and feature term is less important if it appears in different text documents in a training set, known as Inverse Document Frequency (IDF). In TC, feature frequency is one of the most important aspects regardless of its appearance in one or more texts. Many researchers have improved the TF-IDF algorithm by replacing the IDF part with FS algorithms, such as TF-IG and TF-CHI [1315]. This paper mainly focuses on an Improved Gini Index algorithm and its application and proposes an FW algorithm using Improved Gini Index algorithm instead of the IDF part of TF-IDF algorithm, known as TF-Gini algorithm. The experiment results of the algorithm show that the TF-Gini algorithm is a promising algorithm. To test the performance of the Improved Gini Index and TF-Gini algorithms in this paper, we introduce NN [16, 17], NN [18], and SVM [19, 20] as the TC algorithms standards.

As an application of TF-Gini algorithm, this paper designs and implements a sensitive information monitoring system. Sensitive information on the Internet is the most interesting phenomenon that information experts are eager to investigate. Since the application of the Internet is developing very fast, many new words and phrases emerge on the Internet every day. Therefore, new words and phrases pose challenges to sensitive information monitoring. At present, most of the network information monitoring and filtering algorithms are inflexible and inaccurate. By the previous research results, we use the TF-Gini as the core algorithm and design and implement a system for sensitive information monitoring. This system can update sensitive words and phrases in time and has better performance on monitoring the Internet information.

2. Text Classification

2.1. Text Classification Process

The text classification process is shown in Figure 1. The left part is the training process. The right part is the classification process. The function of each part can be described as follows.

2.1.1. The Training Process

The purpose of the training process is to build a classification model. The new samples are assigned to appropriate category by using this model.

(1) Words Segmentation. In English and other Western languages there is a space delimiter between words. In Oriental languages there are no space delimiters between words. Chinese word segmentation is the basis for all Chinese text processing. There are many Chinese word segmentation algorithms [21, 22]; most of them use machine learning algorithms. ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) is the best Chinese lexical analyzer, which is designed by ICT (Institute of Computing Technology, Chinese Academy of Science). In this paper, we use ICTCLAS for Chinese word segmentation.

(2) Stop Words or Stemming. A text document always contains hundreds or thousands of words. Many of the words appear in a very high frequency but are useless, such as “the,” “a,” nonsense articles, and adverbs. These words are called Stop Words [23]. Stop Words, which are language-specific functional words, are frequent words that carry no information. The first step during text process is to remove these Stop Words.

Stemming techniques are used to find out the root or stem of a word. Stemming converts words to their stems, which incorporates a great deal of language-dependent linguistic knowledge. Behind stemming, the hypothesis is that words with the same stem or root mostly describe the same or relatively close concepts in the document. Hence, words can be conflated by using stems. For example, the words, user, users, used, and using all can be stemmed to the word “USE.” In Chinese text classification, we only need Stop Words because there is no conception of stemming in Chinese.

(3) VSM (Vector Space Model). VSM is very popular in natural language processing, especially in computing the similarity between documents. VSM is mapping a document to a vector. For example, if we view each feature word as a dimension, and the word frequency as its value, the document can be represented as an -dimensional vector; that is, , where is the th feature and is the feature’s frequency which is also called weights. Table 1 is an example of a VSM of three documents with ten feature words. , , and are documents; are feature words.

Therefore, we can represent , , and in VSM as , , and .

(4) Feature Words Selection. After Stop Words and stemming, there are still tens of thousands of words in a training set. It is impossible for a classifier to handle so many words. The purpose of feature words selection is to reduce the dimension of features words. It is a very important step for text classification. Choosing the appropriate representative words can improve classifiers’ performance. This paper mainly focuses on this stage.

(5) Classification and Building Classification Model. After the above steps, all the training set texts have been mapped to the VSM. This step uses the classification algorithms on VSM and builds classification model which can assign an appropriate category to new input samples.

2.1.2. The Classification Process

The steps of word segmentation, Stop Words or stemming, and VSM are similar to the classification process. The classification process will assign an appropriate category to new samples. When a new sample arrives, using the classification model calculates the most likely class label of the new sample and assigns it to the sample. The results evaluation is a very important step in the text classification process. It indicates whether the algorithm is good or not.

2.2. Feature Selection

Although the Stop Words and stemming step get rid of some useless words, the VSM still has many insignificant words which even have a negative effect on classification. It is difficult for many classification algorithms to handle high-dimensional data sets. Hence, we need to reduce the dimensional space and improve the classifiers’ performance. FS uses machine learning algorithms to choose appropriate words that can represent the original text, which can reduce the dimensions of the feature space.

The definition of FS is as follows.

Select features from the original features (). The features can be more concise and more efficient to represent the contents of the text. The commonly used FS algorithm can be described as follows.

2.2.1. Information Gain (IG)

IG is widely used in the machine learning field, which is a criterion of the importance of the feature to a class. The value of IG equals the difference of information entropy before and after the appearance of a feature in a class. The greater the IG, the greater the amount of information the feature contains, the more important in the TC. Therefore, we can filter the best feature subset according to the feature’s IG. The computing formula can be described as follows:where is the feature word, is the probability of text that contains appearing in the training set, is the probability of text that does not contain appearing in the training set, is the probability of class in the training set, is the probability of text that contains in , is the probability of text that does not contain in , and is the total amount of classes in training set.

The shortage of IG algorithm is that it takes into account features which do not appear although in some circumstance, the nonappearing features could contribute to the classification or the contribution is far less than that of features appearing in the training set. However, if the training set is unbalanced and , the value of IG will be decided by , and the performance of IG can be significantly reduced.

2.2.2. Expected Cross Entropy (ECE)

ECE is well known as KL distance, and it reflects the distance between the probability of the theme class and the probability of the theme class under the condition of a specific feature. The computing formula can be described as follows:where is the feature word, is the probability of text that contains appearing in the training set, is the probability of class in the training set, is the probability of text that contains in , and is the total amount of classes in training set.

ECE has been widely used in feature selection of TC, and it also achieves good performance. Compared with the IG algorithm, ECE does not consider the case that feature does not appear, which reduced the interference of rare feature which does not occur often, and improves the performance of classification. However, it also has some limitations.(1)If , then . When is larger and is smaller, the logarithmic value is larger, which indicates that feature has strong association with class .(2)If , then . When is smaller and is larger, the logarithmic value is smaller, which indicates that feature has weaker association with class .(3)When , there is no association between feature and the class . In this case, the logarithmic expression has no significance. Introducing a small parameter, for example, , that is, , the logarithmic will have significance.(4)If and are close to each other, then is close to zero. When is large, this indicates that feature has a strong association with class and should be retained. However, when is close to zero, feature will be removed. In this case, we employ information entropy to add to the ECE algorithm. The formula of information entropy is as follows:

In a summary, we combine (2) and (3) together; the ECE formula is as follows:

If feature exists only in one class, the value of information entropy is zero; that is, . Hence we should introduce a small parameter in the denominator as a regulator.

2.2.3. Mutual Information (MI)

In TC, MI expresses the association between feature words and classes. The MI between and is defined aswhere is the feature word, is the probability of text that contains in class , is the probability of text that contains in the training set, and is the probability of class in the training set.

It is necessary to calculate the MI between features and each category in the training set. The following two formulas can be used to calculate the MI:where is the probability of class in the training set and is the total amount of classes in the training set. Generally, is used commonly.

2.2.4. Odds Ratio (OR)

The computing formula can be described as follows:where is the feature word, represents the positive class and represents the negative class, is the probability of text that contains in class , and is the conditional probability of text that contains in all the classes except .

OR metric measures the membership and nonmembership to a specific class with its numerator and denominator, respectively. Therefore, the numerator must be maximized and the denominator must be minimized to get the highest score according to the formula. The formula is a one-sided metric because the logarithm function produces negative scores while the value of the fraction is between 0 and 1. In this case, the features have negative values pointed to negative features. Thus, if we only want to identify the positive class and do not care about the negative class, Odds Ratio will have a great advantage. Odds Ratio is suitable for binary classifiers.

2.2.5. (CHI)

statistical algorithm measures the relevance between the feature and the class . The higher the score, the stronger the relevance between the feature and the class . It means that the feature has a greater contribution to the class. The computing formula can be described as follows:where is the feature word, is the frequency containing feature and class , is the frequency containing feature but not belonging to class , is the frequency belonging to class but not containing the feature , is the frequency not containing the feature and not belonging to class , and is the total number of text documents in class .

When the feature and class are independent, that is, , this means that the feature is not containing identification information. We calculate values for each category and then use formula (9) to compute the value of the entire training set:

3. The Gini Feature Selection Algorithm

3.1. The Improved Gini Index Algorithm
3.1.1. Gini Index (GI)

The GI is a kind of impurity attribute splitting method. It is widely used in the CART algorithm [24], the SLIQ algorithm [25], and the SPRINT algorithm [26]. It chooses the splitting attribute and obtains very good classification accuracy. The GI algorithm can be described as follows.

Suppose that is a set of samples, and these samples have different classes (). According to the differences of classes, we can divide into subsets (). Suppose that is a sample set that belongs to class and that is the total number of ; then the GI of set iswhere is the probability of in and calculated by .

When is that is the minimum, all the members in the set belong to the same class; that is, the maximum useful information can be obtained. When all of the samples in the set distribute equally for each class, is the maximum; that is, the minimum useful information can be obtained.

For an attribute with distinct values, is partitioned into subsets . The GI of with respect to the attribute is defined as

The main idea of GI algorithm is that the attribute with the minimum value of GI is the best attribute which is chosen to split.

3.1.2. The Improved Gini Index (IGI) Algorithm

The original form of the GI algorithm was used to measure the impurity of attributes towards classification. The smaller the impurity is, the better the attribute is. However, many studies on GI theory have employed the measure of purity: the larger the value of the purity is, the better the attribute is. Purity is more suitable for text classification. The formula is as follows:

This formula measures the purity of attributes towards categorization. The larger the value of purity is, the better the attribute is. However, we always emphasize the high-frequency words in TC because the high-frequency words have more contributions to judge the class of texts. But when the distribution of the training set is highly unbalanced, the lower-frequency feature words still have some contribution to judge the class of texts, although the contribution is far less significant than that the high-frequency feature words have. Therefore, we define the IGI algorithm aswhere is the feature word, is the total number of classes in the training set, is the probability of text that contains in class , and is the posterior probability of text that contains in class .

This formula overcomes the shortcoming of the original GI, which considers the features’ conditional probability, combining the conditional probability and posterior probability. Hence, the IGI algorithm can depress the affection when the training set is unbalanced.

3.2. TF-Gini Algorithm

The purpose of FS is to choose the most suitable representative features to represent the original text. In fact, the features’ distinguishability is different. Some features can effectively distinguish a text from others, but other features cannot. Therefore, we use weights to identify different features’ distinguishability. The higher distinguishability of features will get higher weights.

The TF-IDF algorithm is a classical algorithm to calculate the features’ weights. For a word in text , the weights of in are as follows:where is the frequency of appearance in text , is the number of text documents in the training set, and is the total number of texts containing in the training set.

The TF-IDF algorithm is not suitable for TC, mainly because of the shortcoming of TF-IDF in the section IDF. The TF-IDF considers that a word with a higher frequency in different texts in the training set is not important. This consideration is not appropriate for TC. Therefore, we use purity Gini to replace the IDF part in the TF-IDF formula. The purity Gini is as follows:where is the feature word and is a nonzero value. is the probability of text that contains in class , and is the total number of classes in the training set. When , the following formula is the original Gini formula:

However, in TC, the experimental results indicate that when = −1/2, the classification performance is the best:

Therefore, the new feature weighting algorithm, known as TF-Gini, can be represented as

3.3. Experimental and Analysis
3.3.1. Data Sets Preparation

In this paper, we choose English and Chinese data sets to verify the performance of the FS and FW algorithms. The English data set which we choose is Reuters-21578. We choose ten big categories among them. The Chinese data set is Fudan University Chinese text data set (19,637 texts). We use 3,522 pieces as the training texts and 2,148 pieces as the testing texts.

3.3.2. Classifiers Selection

This paper is mainly focused on text preprocessing stage. The IGI and TF-Gini algorithms are the main research area in this paper. To validate the above two algorithms’ performance, we introduce some commonly used text classifiers, such as NN classifier, NN classifier, and SVM classifier. All the experiments are based on these classifiers.

(1) The kNN Classifier. The nearest neighbors (NN) algorithm is a very popular, simple, and nonparameter text classification algorithm. The idea of NN algorithm is to choose the category for new input sample that appears most often amongst its nearest neighbors. Its decision rule can be described as follows:where is the th text in the training set, is a new input sample with unknown category, and is a testing text whose category is clear. is equal to 1 if sample belongs to class ; otherwise it is 0. indicates the similarity between the new input sample and the testing text. The decision rule is

(2) The fkNN Classifier. The NN classifier is a lazy classification algorithm. It does not need the training set. The classifier’s performance is not ideal, especially when the distribution of the training set is unbalanced. Therefore, we improve the algorithm by using the fuzzy logic inference system. Its decision rule can be described as follows:where is the th text in the training set and is the value of memberships for sample belonging to class . is equal to 1 if sample belongs to class ; otherwise it is 0. From the formula, it can be seen that NN is based on NN algorithm, which endows a distance weight on NN formula. The parameter adjusts the degree of the weight, . The NN decision rule is as follows:

(3) The SVM Classifier. Support vector machine (SVM) is a potential classification algorithm, which is based on statistical learning theory. SVM is highly accurate, owing to its ability to model complex nonlinear decision boundary. It uses a nonlinear mapping to transform the original set into a higher dimension. Within this new dimension, it searches for the linear optimal separating hyperplane, that is, decision boundary, with an appropriate nonlinear mapping to a sufficiently high dimension; data from two classes can always be separated by a hyperplane. Originally, SVM mainly solves the problem of two-class classification. The multiclass problem can be addressed using multiple binary classifications.

Suppose that in the feature space a set of training vectors   () belong to different classes . We wish to separate this training set with a hyperplane:

Obviously, there are an infinite number of hyperplanes that are able to partition the data set into sets. However, according to the SVM theory, there is only one optimal hyperplane, which is lying half-way within the maximal margin as the sum of distances of the hyperplane to the closest training sample of each class. As shown in Figure 2, the solid line represents the optimal separating hyperplane.

One of the problems of SVM is to find the maximum separating hyperplane, that is, the one maximum distance between the nearest training samples. In Figure 2, is the optimal hyperplane; and are the parallel hyperplanes to . The task of SVM is making the margin distance between and maximum. The maximum margin can be written as follows:

Therefore, from Figure 2, we can get that the process of solving SVM is making the margin between and maximum:

The equation is also the same as minimizing . Therefore, it becomes the following optimization question:

The formula means the points which satisfy , called support vectors, namely, the points on and in Figure 2, are all support vectors. In fact, many samples are not classified correctly by the hyperplane. Hence, we introduce slack variables. Then, the above formula becomeswhere is a positive constant to balance the experience and confidence and is the total number of classes in the training set.

3.3.3. Performance Evaluation

Confusion matrix [27], also known as a contingency table or an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning algorithm. It is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives. Table 2 shows the confusion matrix.

True positive (TP) is the number of correct predictions that an instance is positive. False positive (FP) is the number of incorrect predictions that an instance is positive. False negative (FN) is the number of incorrect predictions that an instance is positive. True negative (TN) is the number of incorrect predictions that an instance is negative.

Precision is the proportion of the predicted positive cases that were correct, as calculated using the equation

Recall is the proportion of the actual positive cases that were correct, as calculated using the equation

We hope that it is better when the precision and recall have higher values. However, both of them are contradictory. For example, if the result of prediction is only one and accurate, the precision is 100% and the recall is very low because TP equals 1, FP equals 0, and FN is huge. On the other hand, if the prediction as positive contains all the results (i.e., FN equals 0), therefore the recall is 100%. The precision is low because FP is large.

[28] is the average of precision and recall. If is high, the algorithm and experimental methods are ideal. is to measure the performance of a classifier considering the above two methods and was proposed by Van Rijsbergen [29] in 1979. It can be described as follows:

Precision and recall only evaluate the performance of classification algorithms in a certain category. It is often necessary to evaluate classification performance in all the categories. When dealing with multiple categories there are two possible ways of averaging these measures, namely, macroaverage and microaverage. The macroaverage weights equally all classes, regardless of how many documents belong to it. The microaverage weights equally all the documents, thus favoring the performance in common classes. The formulas of macroaverage and microaverage are as follows, and is the number of classes in the training set:

3.3.4. Experimental Results and Analysis

Figures 3 and 4 are the results on the English data set. The performance of classical GI is the lowest, and the IGI is the highest. Therefore, the IGI algorithm is helpful to improve the classifiers’ performance.

From Figures 5 and 6, it can be seen that, in Chinese data sets, the classical GI still has poor performance. The best performance belongs to ECE, and then the second is IG. The performance of IGI is not the best. Since the Chinese language is different from the English language, at the processing of Stop Words, the results of Chinese word segmentation algorithm will influence the classification algorithm’s accuracy. Meanwhile, the current experiments do not consider the feature words’ weight. The following TF-Gini algorithm has a good solution to this problem.

Although the IGI algorithm does not have the best performance in the Chinese data set, its performance is very close to the highest. Hence, the IGI feature selection algorithm is suitable and acceptable in the classification algorithm. It promotes the classifiers’ performance effectively.

Figures 7, 8, 9, and 10 are results for the English and Chinese date sets. To determine which algorithm’s performance is the best, we use different TF weights algorithms, such as TF-IDF and TF-Gini.

From Figures 7 and 8, we can see that the TF-Gini weights algorithm’s performance is good and acceptable for the English data set; its performance is very close to the highest. From Figures 9 and 10, we can see that on the Chinese data set the TF-Gini algorithm overcomes the deficiencies of the IGI algorithm and gets the best performance. These results show that the TF-Gini algorithm is a very promising feature weight algorithm.

4. Sensitive Information Identification System Based on TF-Gini Algorithm

4.1. Introduction

With the rapid development of network technologies, the Internet has become a huge repository of information. At the same time, much garbage information is poured into the Internet, such as a virus, violence, and gambling. All these are called sensitive information that will produce extremely adverse impacts and serious consequences.

The commonly used methods of Internet information filtering and monitoring are string matching. There are a large number of synonyms and ambiguous words in natural texts. Meanwhile, a lot of new words and expressions appear on the Internet every day. Therefore, the traditional sensitive information filtering and monitoring are not effective. In this paper, we propose a new sensitive information filtering system based on machine learning. The TF-Gini text feature selection algorithm is used in this system. This system can discover the sensitive information in a fast, accurate, and intelligent way.

4.2. System Architecture

The sensitive system includes the text preprocessing, feature selection, text representation, text classification, and result evaluation modules. Figure 11 shows the sensitive system architecture.

4.2.1. The Chinese Word Segmentation

The function of this module is segmentation of Chinese words and getting rid of Stop Words. In this module, we use IKanalyzer to complete Chinese words segmentation. IKanalyzer is a lightweight Chinese words segmentation tool based on an open source package.(1)IKanalyzer uses a multiprocessor module and supports English letters, Chinese vocabulary, and individual words processing.(2)It optimizes the dictionary storage and supports the user dictionary definition.

4.2.2. The Feature Selection

At the stage of feature selection, the TF-Gini algorithm has a good performance on Chinese data sets. Feature words selection is very important preparation for classification. In this step, we use TF-Gini algorithm as the core algorithm of feature selection.

4.2.3. The Text Classification

We use the Naïve Bayes algorithm [30] as the main classifier. In order to verify the performance of the Naïve Bayes algorithm proposed in this paper, we use the NN algorithm and the improved Naïve Bayes [31] algorithm to compare with it.

The NN algorithm is a typical algorithm in text classification and does not need the training data set. In order to calculate the similarity between the test sample and the training samples , we adopt the cosine similarity. It can be described as follows:where is the th feature word’s weights in and is the th feature word’s weights in sample . is the number of feature spaces.

Naïve Bayes classifier is based on the principle of Bayes. Assume classes and a given unknown text with no class label. The classification algorithm will predict if belongs to the class with a higher posterior probability in the condition of . It can be described as follows:

Naïve Bayes classification is based on a simple assumption: in a given sample class label, the attributes are conditionally independent. Then, the probability is constant, and the probability of is also constant, so the calculation of can be simplified as follows:where is the th feature word’s weights of text in class and is the number of features in .

4.3. Experimental Results and Analysis

In order to verify the performance of the system, we use two experiments to measure the sensitive information system. We use tenfold cross validation and threefold cross validation in different data sets to verify the system.

Experiment 1. The Chinese data set has 1,000 articles, including 500 sensitive texts and 500 nonsensitive texts. The 1,000 texts are randomly divided into 10 portions. One of the portions is selected as test set and the other 9 samples are used as a training set. The process is repeated 10 times. The final result is the average values of 10 classification results. Table 3 is the result of Experiment 1.

Experiment 2. The Chinese data set has 1,500 articles, including 750 sensitive texts and 750 nonsensitive texts. All the texts are randomly divided into 3 parts. One of the parts is test set; the others are training sets. The final result is the average values of 3 classification results. Table 4 is the result of Experiment 2.

From the results of Experiments 1 and 2, we can see that all the three classifiers have a better performance. The Naïve Bayes and Improved Naïve Bayes algorithms get a better performance. That is, the sensitive information system could satisfy the demand of the Internet sensitive information recognition.

5. Conclusion

Feature selection is an important aspect in text preprocessing. Its performance directly affects the classifiers’ accuracy. In this paper, we employ a feature selection algorithm, that is, the IGI algorithm. We compare its performance with IG, CHI, MI, and OR algorithms. The experiment results show that the IGI algorithm has the best performance on the English data sets and is very close to the best performance on the Chinese data sets. Therefore, the IGI algorithm is a promising algorithm in the text preprocessing.

Feature weighting is another important aspect in text preprocessing, since the importance degree of features is different in different categories. We endow weights on different feature words; the more important the feature word is, the bigger the weights the feature word has. According to the TF-IDF theory, we construct a feature weighting algorithm, that is, TF-Gini algorithm. We replace the IDF part of TF-IDF algorithm with the IGI algorithm. To test the performance of TF-Gini, we also construct TF-IG, TF-ECE, TF-CHI, TF-MI, and TF-OR algorithms in the same way. The experiment results show that the TF-Gini performance is the best in the Chinese data sets and is very close to the best performance in the English data sets. Hence, TF-Gini algorithm can improve the classifier’s performance, especially in Chinese data sets.

This paper also introduces a sensitive information identification system, which can monitor and filter sensitive information on the Internet. Considering the performance of TF-Gini on Chinese data sets, we choose the algorithm as text preprocessing algorithm in the system. The core classifier which we choose is Naïve Bayes classifier. The experiment results show that the system achieves good performance.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This paper is partly supported by the engineering planning project Research on Technology of Big Data-Based High Efficient Classification of News funded by Communication University of China (3132015XNG1504), partly supported by The Comprehensive Reform Project of Computer Science and Technology funded by Ministry of Education of China (ZL140103 and ZL1503), and partly supported by The Excellent Young Teachers Training Project funded by Communication University of China (YXJS201508 and XNG1526).