Abstract

Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

1. Introduction

An overwhelming number of biological articles are published daily online as a result of growing interest in biological research, especially relating to the study of protein-protein interactions (PPIs). It is essential to classify which articles describe PPIs, that is, to filter out those irrelevant articles from the whole collection of the biological literature. This allows a more efficient extraction of PPIs from the large amount of biological literature. Automated text classification is a key technology to rapidly find relevant articles. Text classification has been successfully applied to various domains such as text sentinel classification [1], spam e-mail filtering [2, 3], author identification [4], and web page classification [5]. Research on protein interaction article classification (IAC) is a text classification task with practical significance in the biological domain.

In the classic text classification framework, a feature extraction mechanism extracts features from raw articles, including all distinct terms (words). This is also known as bag-of-words (BOW) representation for text documents. Hence each article is represented by a multidimensional feature vector where each dimension corresponds to a term (feature) within the literature collection. Even a small literature collection would contain tens of thousands of features [6, 7]. The high dimensionality of the feature space not only increases computational time but also degrades classification performance. Hence, automated feature selection plays an essential role in making the text classification more efficient and accurate by selecting a subset of the most important features [8, 9]. Feature selection is an active research area in many fields such as data mining, machine learning, and rough sets [1013].

The process of feature selection typically involves certain metrics that are designed for measuring the importance level of features, and the most important features are selected to help in efficient utilization of resources for large scale problems [14]. The existing feature selection methods are mostly based on the statistical information in documents, including term frequency and document frequency [7, 1418]. Term frequency is the number of times a particular term appears in a document while document frequency is the number of documents containing that term within the literature collection. One potential drawback of most of these frequency-based feature selection methods is that they treat each feature separately [19]. In other words, these approaches are context independent: they do not utilize the context information around the terms when judging their importance, such as word order, word cooccurrence, multiword chunks, and semantic relationships. However, this information is important for classifying which articles are PPI relevant or nonrelevant. For example, protein names exist in both PPI relevant and nonrelevant documents. So they could have great document frequency or term frequency. However, obviously they are not distinctive terms for the purpose of classification. Hence, it is difficult to measure the importance of all the terms just according to the document frequency or term frequency. After in-depth research we have noticed that, in the PPI relevant documents, the fact that proteins interact with each other is described through the context of those proteins. Meanwhile in nonrelevant documents, the fact that there are no interactions between the particular proteins is also depicted within the context of the documents. The above observation leads us to an interesting issue which is that the context of features in biological articles can be utilized to measure feature importance and to improve the feature selection process. Hence we propose context similarity-based feature selection methods.

This paper is organized as follows: we provide an overview of the existing frequency-based feature selection methods for text classification in Section 2, and this is followed by a definition of the proposed context similarity-based feature selection methods. Then in order to examine the two kinds of methods carefully, the experimental results and discussion are presented in Section 3 to find which one is more useful in the IAC task. This is followed by a conclusion in Section 4.

2. Materials and Methods

2.1. Existing Feature Selection Methods for Text Classification

Feature selection is a process which selects a subset of the most important features. Such selection can help in building effective and efficient models for text classification. Normally, feature selection techniques can be divided into three categories: filters, wrappers, and embedded methods [19]. Filters measure feature importance using various scoring metrics that are independent of a learning model or classifier and select top- features attaining the highest scores. Univariate filter techniques are computationally fast. However, they do not take feature dependencies into consideration, which was discussed as the motivation of this paper in Section 1. In addition, multivariate filter techniques incorporate feature dependencies to some degree, while they are slower and less scalable than univariate techniques. Wrappers evaluate features using a certain search algorithm together with a specific learning model or classifier. Wrapper techniques consider feature dependencies and provide interaction between features during the subset search processing but are computationally expensive compared with filters. Embedded methods integrate feature selection into the model learning phase. Therefore, they merge with the model or classifier much further than the wrappers. Nevertheless, they are also computationally more intensive than filters.

Considering the high dimensionality of the feature space for text classification tasks, the most frequently used approach for feature selection is the univariate filter method [7]. And among them four document frequency-based methods and two term frequency-based methods that will be discussed in the paper are illustrated as follows, where is the percentage of documents belonging to a category in which the term occurs and is the percentage of documents not belonging to a category in which the term occurs. is the number of categories, which is 2 for the IAC task.

(1) Document Frequency (DF). Document frequency (DF) is a simple and effective feature selection method which is based on the assumption that infrequent terms are not reliable in text classification and may degrade the performance [7]. Hence, if the document frequency in which a term occurs is the largest, the term is retained [20]. The DF metrics of the term can be computed as follows:where is the DF measure of the term in a category and is the sum of across all the categories.

(2) Gini Index (GI). Gini Index (GI) was originally used to find the best attributes in decision trees. Shang et al. [15] proposed an improved version of the GI method to apply it directly to text feature selection. The measures the purity of the feature towards a category . Its sum across categories, , is given aswhere is the conditional probability of the feature belonging to a category given presence of the feature .

(3) Class Discriminating Measure (CDM). Class discriminating measure (CDM) is a derivation of the odds ration introduced by Chen et al. [16]. The results in their paper indicate that CDM is a better feature selection approach than information gain (IG). The CDM calculates the effectiveness of the term as follows:where is the CDM measure of the term in a category and is the sum of across all the categories.

(4) Accuracy Balanced (Acc2). Accuracy balanced (Acc2) is a two-side metric (it selects both negative and positive features) that is based on the difference of the distributions of a term belonging to a category and not belonging to that category in the documents. In Forman [14], the Acc2 is studied and claimed to have a performance comparable to the IG and chi-square statistical metrics. The Acc2 of the term can be computed as follows:where is the Acc2 measure of the term in a category and is the sum of across all the categories.

(5) Term Frequency Inverse Document Frequency (TFIDF). Term frequency inverse document frequency (TFIDF) is a numerical statistic that is intended to reflect how important a term is to a document in a collection or corpus. One of the simplest filter metrics is computed by summing the TFIDF. Wei et al. [21] introduced category information to TFIDF, which can be reformed using a notation of term frequency that is the number of occurrences of a term in documents from a category . Consider

(6) Normalized Term Frequency-Based Gini Index (). Normalized term frequency-based Gini Index () revised the document frequency in the Gini Index metric with the term frequency by Azam and Yao [17]. Experimental results revealed that the term frequency-based metric was useful in feature selection. We reform the formula of as follows:where is the normalized term frequency of in documents from a category and is the normalized term frequency of in documents not from a category . The normalized values of term frequency are used in the metric so that term frequencies are not influenced by varying lengths of documents.

2.2. Context Similarity-Based Feature Selection Methods

According to the bag-of-words document representation, each raw document in the article collection is transformed into a high-dimensional vector before the process of text classification. In order to address the issues of high dimensionality, the feature filter methods, such as the , , , and , are utilized to select the most important features based on document frequency. One potential problem of these frequency-based methods is that they ignore the context relationships between features. As we have discussed in Section 1, context information is essential for the IAC task. When attempting to judge the importance levels of features, it may be advantageous to explicitly compare the similarity shared among contexts in PPI relevant articles or nonrelevant articles. Hence when building the feature selection metrics, we take the significance of context information of each feature into account through the context similarity.

Context Similarity Measure. is designed to explicitly express the similarity shared by contexts of the term in a certain category . The measure is based on the word cooccurrences and chunks of a pair of context strings and containing the term within a category . denotes a document containing a term within a context string , where is a window size that takes into account terms before and after the term . The term is contained in another context string of document , , which is with the window size . Using , a multiword phrase chunk containing and its word cooccurrence can be considered to measure the importance of .

First is defined to measure the similarity between the context string pair as follows:

The sum of all the context strings from to maximum window size is utilized to incorporate word cooccurrence and phrase similarity comprehensively. is used to control the scope of the local information of term involved in the measurement, and trials on the training data show that is the optimal value. In this paper, Jaro-Winkler [22] distance is employed as the distance function of two context strings, , because it was designed and best suited for short strings. The Jaro-Winkler distance is a measure of similarity between two strings, and it is a variant of the Jaro distance metric [23, 24]. The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Then, is defined to measure the similarity of context in the documents containing the term belonging to a category as follows:

Context Similarity-Based Feature Selection Methods. In order to elaborate the context similarity-based feature selection metrics, the class discriminating measure (CDM) is considered as an example, which was very useful in reducing the feature set in some application domains. The metric of CDM has been defined in Section 2.1 based on and . Here , the percentage of documents with the term belonging to the category , can also be represented as , where is the document frequency containing the term in the category and is the total number of articles in the category . , the percentage of documents with the term not belonging to the category , can be represented as , where is the document frequency containing the term not in the category and is the total number of articles not in the category . Hence, we can have the following CDM metric:

In order to make use of the context information of terms and not just the document frequency, we substitute the context similarity measure for the document frequency . Then the obtained metric with reformed definition is referred to as , class discriminating measure based on context similarity. If the context similarity of a term within a certain text category is greater, the term is more important for text classification. The definition of is as follows:

The other three document frequency-based metrics defined in Section 2.1 can also be reformed in the same way based on the context similarity to , , and :where is the number of documents containing the term in all the text categories.

3. Results and Discussion

3.1. Experimental Settings

Classification Model . Support vector machines (SVMs) pioneered by Vapnik [25] are suitable for complex classification problems. Their power comes from the combination of the kernel trick and maximum margin hyperplane separation. SVMs are one of the most successful approaches for classification in text mining [26, 27]. Hence, in this paper, we employ the SVMs with a polynomial kernel as a classification model, , which is trained and tested using the LIBSVM toolbox [28]. A 10-fold cross-validation is adopted to tune parameters.

Data Sets. An in-depth investigation will be carried out to compare the performances of the four proposed context similarity-based methods and the six existing frequency-based feature selection methods. Two data sets ( and ) are used in our experiments to evaluate the performance, which are both extracted from the BioCreAtIvE (the Critical Assessment of Information Extraction in Biology) challenges. The challenges were set up to evaluate the state of the art of text mining and information extraction in the biological domain.

In the data preprocessing step, all words are converted to lower case, punctuation marks and stop words are removed, and no stemming is used. Consider the following.(1): we obtain the from the Protein Interaction Article Subtask (IAS) of the BioCreAtIvE II challenge [29]. The is composed of abstracts of 6,172 articles in total, which are taken from a set of MEDLINE articles that are annotated as interaction articles or not according to the guidelines used by the MINT and IntAct databases. There are 5,495 abstracts used as training data and 677 ones as test data. And there are 3,536 and 338 interaction articles, that is, positive examples, in the training and test set, respectively.(2): we obtain the from the PPI Article Classification Task (ACT) of the BioCreAtIvE III challenge [30]. The training set (TR) consists of a balanced collection of 2,280 articles classified through manual inspection, divided into PPI relevant and nonrelevant articles. The annotation guidelines for this task were refined iteratively based on the feedback from both annotation databases and specially trained domain experts. The development (DE) and test (TE) set take into account PPI relevant journals based on the current content of collaborating PPI databases. Random samples of abstracts from these journals were taken to generate a development set of 4,000 abstracts (628 PPI relevant and 3,318 nonrelevant abstracts) in total and a test set of 6,000 abstracts (918 PPI relevant and 5,090 nonrelevant abstracts). These two disjointed sets were drawn from the same sample collection.

Performance Measures. Since the applications are restricted to IAC, which is a binary classification task, we measure the performance in terms of measure [20]. The is determined by a combination of precision and recall. Precision is the percentage of documents that are correctly classified as being positive. Recall is the percentage of positive documents that are correctly classified. The precision, recall, and are obtained aswhere is the number of positive documents that are correctly classified as positive ones, is the number of negative documents that are misclassified as positive ones, is the number of negative documents that are correctly classified as negative ones, and is the number of positive documents that are misclassified as negative ones.

3.2. Experimental Results on the

First, we test all the feature selection methods when is applied on the data set, where there are 29,979 total features extracted using the bag-of-words document representation. The proposed context similarity-based methods, , , , and , are compared with the frequency-based methods, , , , , , and , when the number of the selected features is the top 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%. Figure 1 shows the trend curves of all the feature selection methods, and the optimal parameter value of the window size of context information is 3, which is tuned through 10-fold cross-validation.

Figure 1 indicates that all these feature selection methods have a similar trend on the , and the proposed methods are more effective. The context similarity-based methods and the term frequency-based methods achieve the best performance when around 4% top important features are selected, while the document frequency-based methods obtain the best performance when around 7-8% features are used. Moreover, the proposed methods outperform the other methods on selecting the top important features to achieve the best measure. Among the context similarity-based feature selection methods, when the top 1300 features (4.3% of total number of features) are selected, acquires the highest measure 77.07, which effectively improves the measure of the when all the features are used (73.55) by 3.52.

Further, in order to study the performance of all these feature selection methods in more detail, a small feature set in the scope of the top 2000 is used. The corresponding measure results are shown in Table 1 when the top 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900 features are selected. The best result for each feature set is shown in bold. It can be seen from Table 1 that the context similarity-based methods outperform those methods based on the document frequency or term frequency. The last column of Table 1 presents the best performance of the that various feature selection methods can achieve, and the size of selected features when the best performance is achieved is illustrated in the parentheses. It can be seen that, compared with the four document frequency-based methods, the and the perform better, which shows that term frequency is a relatively more important factor than document frequency. Moreover, all the context similarity-based methods achieve better performance with fewer selected features, and among them the performs the best on the . Hence, the proposed method can extract more effective information from context similarity measure of term cooccurrences and chunks than just calculating the document frequency or term frequency. This context information is helpful when measuring the importance of features to boost the performance.

3.3. Experimental Results on the

Then, we test the proposed feature selection methods on the when the number of selected features is the top 0.5%, 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100%, where there are 23,084 features extracted using the bag-of-words representation in total. Figure 2 shows the trend curves of the measure versus different sizes of selected features. From Figure 2 we can see that when around 7% top important features are used, the proposed methods and term frequency-based methods can achieve the best performance, while document frequency-based methods need to utilize more than 15% top features to achieve their best performance, which is less effective.

Then, for the purpose of more detailed study on a small feature set, Table 2 shows the measure results when the size of the selected features is 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900. The best result for each feature set is shown in bold. It can be seen that on the the performance of the context similarity-based methods is also better than that of their corresponding frequency-based methods. And when the size of the feature set is 1700 (7.4% of the total number of features), acquires the highest measure value 59.97, which improves the measure of the when all the features are used (57.12) by 2.85. Hence the context information of terms is helpful for the feature selection in IAC applications.

We notice that there is a significant drop in performance from the to , which suffered from the fact that the training article collection is extracted from different online article sources compared with the test data sets, and that the test data sets have the high class skew problem [30].

3.4. Analysis and Discussion

Comparison of the Selected Features. Besides the measure results, we also analyze the effectiveness of feature selection methods through studying the profile of the selected features. The sorted lists of the top-10 features picked by each method are given in Tables 3 and 4 on the and , respectively. The features that are selected commonly by all the methods are indicated in bold. These common features make the same contribution to the classification performance, such as “interact” in Table 3 and “interaction” in Table 4. Hence we compare the special features selected by different methods. We note that there are two categories of special selected features according to two different feature selection principals. The first category features are the ones selected based on the statistical frequency. These features obtain higher scores because more documents contain them or they occur more. However, the term cooccurrences and chunks within the document are ignored. For example, the terms “protein” and “cell” are selected by all the frequency-based methods but the context similarity-based methods on both and . Considering “protein,” it is just used to describe different protein names, which can appear anywhere in biological articles with the result of high document frequency or term frequency. However, it is not a distinctive feature to classify PPI relevant or nonrelevant articles. If such irrelevant features are assigned higher scores by a feature selection method, the performance obtained by those features would be degraded. On the contrary, these features are assigned lower values by our proposed methods, because their context dissimilarity between the PPI relevant and nonrelevant articles depresses their scores. The second category features are shared by the context similarity-based methods, such as the terms “activate” in Table 3 and “activity” in Table 4. Their evaluation scores are raised by the context similarity within the PPI relevant articles, which is important for the classification purpose.

In order to further study the proposed methods on common and special selected features, the top 1000 features are selected on both data sets, respectively. We perform experiments on the pairs of one context similarity-based method and one frequency-based feature selection method. First, the common features selected for each pair by both feature selection methods are fed into the . Then the performance of this based on the common features is compared with the performance achieved based on all the top 1000 features selected by the context similarity-based method and the frequency-based method, respectively. Our purpose is to reveal which kind of feature selection methods can increase the performance more with their special selected features. The results are listed in Tables 5 and 6 on the and , respectively. It can be seen that the increments of context similarity-based methods are higher than the frequency-based methods, so the special features selected through context similarity-based methods can bring more distinctive information for the classifier on both data sets.

Dimension Reduction Rate. In addition to measure, dimension reduction rate is another important aspect of feature selection. Therefore, a dimension reduction is also studied during the experiments. To compute a dimension reduction rate together with the measure, a scoring scheme from Gunal and Edizkan [31] is defined as follows:where is the number of trails, is the maximum feature size, is the feature size at the th trail, and is the measure of the th trail. Here, is a set of sequences, 100, 300, 500, 700, 900, 1100, 1300, 1500, 1700, and 1900, and so is 10. The results of dimension reduction analysis using the described scoring scheme are presented in Table 7. It is apparent from this table that the context similarity-based feature selection methods provide better performance than the frequency-based methods.

4. Conclusions

In this paper, novel context similarity-based feature selection methods were introduced for text classification in the biological domain to classify protein interaction articles. They assign importance scores to features based on their similarity measure of context information within certain text categories. Using two different data sets, the performance of the proposed methods was investigated and compared against four document frequency-based and two term frequency-based methods. The effectiveness of the proposed methods was demonstrated and analyzed on the measure, the profile of selected features, and dimension reduction rate for the IAC tasks. Since IAC is a binary text classification task in biological domain, we also want to know the performance of the proposed methods when they are extended to multiclass problems. Hence, an adaptation of the context similarity-based selection method to multiclassification problems remains an interesting future task.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors want to thank the anonymous reviewers for their helpful comments and suggestions. This work is supported by the National Natural Science Foundation of China (nos. 61202135 and 61402231), the Natural Science Foundation of Jiangsu Province (nos. BK2012472 and BK2011692), the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province (no. 12KJD520005), and the Qing Lan Project.