A Linear-RBF Multikernel SVM to Classify Big Text Corpora

Romero, R.; Iglesias, E. L.; Borrajo, L.

doi:https://doi.org/10.1155/2015/878291

BioMed Research International

On this page

Abstract Introduction Results and Discussion Conclusions Acknowledgments References Copyright Related Articles

Special Issue

Distributed Artificial Intelligence Models for Knowledge Discovery in Bioinformatics

View this Special Issue

Research Article | Open Access

Volume 2015 | Article ID 878291 | https://doi.org/10.1155/2015/878291

A Linear-RBF Multikernel SVM to Classify Big Text Corpora

R. Romero,¹E. L. Iglesias,¹and L. Borrajo¹

Academic Editor: Juan M. Corchado

Received22 Aug 2014

Revised10 Nov 2014

Accepted13 Nov 2014

Published23 Mar 2015

Abstract

Support vector machine (SVM) is a powerful technique for classification. However, SVM is not suitable for classification of large datasets or text corpora, because the training complexity of SVMs is highly dependent on the input size. Recent developments in the literature on the SVM and other kernel methods emphasize the need to consider multiple kernels or parameterizations of kernels because they provide greater flexibility. This paper shows a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search. The model consists in spreading the dataset into cohesive term slices (clusters) to construct a defined structure (multikernel). The new approach is tested on different text corpora. Experimental results show that the new classifier has good accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.

1. Introduction

The amount of information stored in public resources continues to grow. For example, the Medline bibliographic database, the most important source in the biomedical domain, stores documents since 1950 and contains more than 22 million citations. Thus, in order to manage this volume of documents, the use of sophisticated computer tools must be considered.

In the last years, researchers show a special interest of applying text mining techniques to the field of biomedicine, as pattern recognition, automatic categorization, or classification techniques. In order to get good results, the need to establish a unified data structure to represent documents must be accomplished.

A well-known data structure supported by the scientific community is the sparse matrix [1], which is commonly managed by classifiers as input data. In it, each document is decomposed as a vector of its more relevant terms (words).

Unfortunately, although an efficient data structure solves problems related to performance, other inconveniences about the size of the corpora impact negatively over classifiers and their accuracy. Data imbalance problems exist in a broad range of experimental data and have captured the attention of researchers [2, 3]. Data imbalance occurs when the majority class in a document corpus is represented by a large portion of documents, while the minority class has only a small percentage [4]. When a text classifier encounters an imbalanced document corpus, the performance of machine learning algorithms often decreases [5–8].

Another important situation in a classification process, which can render the problem unmanageable, is related to the sparse matrix dimensionality. The matrix dimension is directly connected to the amount of attributes (terms) of the documents included in it, affecting the performance of the classifier and attaching a high computational cost. At this point, algorithms to select relevant terms from whole data structure must be considered. As a result, an optimized sparse matrix is generated.

Regarding classifiers, support vector machine (SVM) [9–12] is one of the most well-known classification techniques used within the scientific community. It obtains good results in a variety of classification problems, although it is difficult to determine its parameterization with imbalanced data. A SVM classifier uses a kernel function to make a transformation over the data and change the workspace, separating relevant from nonrelevant documents. Taking into account that some kernels have additional parameters that must be selected, the parameterization of a SVM has a high cost.

As with other classifiers, SVMs are not suitable to classify large datasets due to their high training complexity. Support vectors are internally computed to represent the dataset; this helps to find a hyperplane that separates the contents of each class. The complexity of a SVM is given by the number of support vectors needed to get the hyperplane. Data dimensionality negatively affects the kernel coverage, such that a unique kernel may not be enough to get an optimal division between classes.

One solution is to divide the dataset into small portions, attaching a specific kernel to each slice, decreasing the training complexity, and improving classification results. This idea is known as a SVM based on a multikernel transformation.

Multikernel algorithms combine predefined kernels in order to obtain accurate hyperplanes. These kernels and their parametrization are usually determined by different learning methods. However there is not an efficient learning method to cover all classification scenarios, because it is highly dependent of the field of study. Gönen and Alpaydın [14] establish a category of existing multikernel algorithms focused on their learning methods and properties.(i)Fixed rules are functions which combine multiple single kernels grouping them as sums or products and working over the data slice by slice [15, 16]. Kernels are usually unweighted and do not need any training before applying them. However, other approximations include several coefficients to weigh each term in order to penalize some multikernel parts. Even so, value coefficients are adjusted based on empirical results or brute-force algorithms.(ii)Heuristic approaches combine the idea behind fixed rules to weigh each multikernel term under best coefficient values [17, 18]. These values are usually determined by unsupervised algorithms such as ID3 trees, hierarchical clustering, or self-organizing maps, among others, which may be applied separately (term by term) or one over all of them. In almost all cases, the search space is extremely wide (original or feature), becoming the scenario in a NP-complete problem. Thus, the computational cost and the system performance must be taken into account.(iii)Optimization approaches consist in providing optimal values for kernel function parameters. Usually, based on external models, this optimization can be integrated as a part of a kernel-based learner or reformulated as a different mathematical model for obtaining the parameter values, and then parametrize the learner [19, 20].(iv)In Bayesian approaches, kernels are combined and interpreted as probabilistic variables. These parameters (in kernels) are used to perform inference for learning them and the base learner parameters. Bayesian functions measure the quality of the resulting kernel function constructed from candidate kernels using a Bayesian formulation. In general, we use as target function the likelihood or the posterior to find the maximum likelihood estimator and then obtain the model parameter values [21, 22].(v)Boosting approaches, inspired on ensemble algorithms, combine weak learning models to produce a new complex strong one [23]. A set of pairwise SVM-kernels may be configured and trained separately to get a final voting result in testing stage. There are different ways in which the combination can be done, including the previous approaches. The models may be predefined or it is possible to add a new kernel until the performance stops improving [23, 24].

In this paper, we show a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search.

The remainder of the paper proceeds as follows. The general text classification model is described in Section 2. The proposed model is presented in Section 3, matching and explaining differences with the previous section. The analysis of experimental tests and comparative results with other authors are shown in Section 4. Finally, the most relevant conclusions are collected at Section 5.

2. Text Classification

Text classification is focused on assigning a class to each document of a corpus. Thus, a class encloses those documents which are representative from a specific topic. The class assignment can be performed manually or automatically.

In general, the text classification process includes a set of steps, as shown in Figure 1. These steps are detailed in the next subsections.

2.1. Document Processing

During the first step, each document in the corpus is processed to extract its most representative keywords (terms). As each term has a different relevance when it is used to describe the document content, a numerical weight is assigned. This weight quantifies the importance of the term for describing the document semantic. Moreover, a data normalizing process is used to transform term weights into a new unified value range, with TF-IDF (term frequency-inverse document frequency) being the most used normalization process [1].

As a result, each document is represented by a -dimensional vector (instance), where is the total number of terms in the corpus and an associated class (relevant or nonrelevant) (see (1) (term vectors for a document corpus)). The similarity between two documents is computed based on the distance of their representative vectors. Consider

2.1.1. Stemming and Stopwords

In many cases, irrelevant terms are included on the sparse matrix, thus decreasing the classification results. In order to partially remove the noise, some stemming techniques and stopword removal are used.

Stemming techniques [25] morphologically identify terms and their variants (nouns, adjectives, adverbs, etc.) and reduce the data dimensionality through a step called conflation. It is to extract the stem of all the terms and apply a matching process to fuse or combine the terms, avoiding variants in the final representation.

Stopword lists [26] are wordlists composed of irrelevant terms such as articles, determiners, or interrogative particles. These terms are usually excluded during the document matrix generation.

In this way, combining stopword filtering and stemming techniques helps to avoid nonuseful terms and to significantly improve the information retrieval systems and their results.

2.2. Manage Data Dimensionality

In general, the use of stemming and stopword removal is not enough to obtain a good document classification for huge datasets. Thus, in a postprocessing step, algorithms and techniques are focused on reducing, compacting, or transforming the matrix containment. Normally, two approaches are considered.(i)Instance filtering focused on balancing the number of instances (documents) in each class (topic), taking into account their difference factor. In some cases, unbalanced problems may negatively affect the classification process causing overfitted models.(ii)Attribute (term) selection algorithms transform and remove (in some cases) current terms in the document matrix in order to reduce its size and computational cost.

2.2.1. Instance Filtering

Data imbalance problem appears when a majority class, usually the negative class, contains many more instances that the other class [2–4]. When a text classifier encounters an imbalanced document corpus, the machine learning performance often decreases [5–8].

Instance filtering represents a powerful tool against overfitting cases with regard to a specific class type (majority class in almost all cases). Two well-known techniques, oversampling and subsampling, may be applied on texts to redistribute each class and solve the imbalance [27, 28].

The subsampling technique removes instances in the majority class by taking into account a difference factor with the minority class. A random algorithm is usually used to select which instance is removed until the redistribution factor is reached. Similarly, the oversampling technique adds new or replicated instances in the minority class until the difference factor with the majority class is reached. Equal to subsampling, a random algorithm is used to select which instances are the base of the replication process.

Finally, both techniques can be applied simultaneously, increasing instances at the minority class and decreasing on the majority class. This process is known as resampling [29]. The Weka library [30], used in this study, provides algorithms which implement these techniques.

2.2.2. Attribute Selection

Data sizes can be optimized by trying to find the most relevant attributes (terms) in a dataset. Attribute selection algorithms are focused on the relevance of a term in a document, class, or both, removing, merging, and/or transforming those terms that are less important and generating a new dataset. Therefore, an attribute selection task pursues the following goals: (i) to reach better classification results, (ii) to generate more efficient models, and (iii) to reduce the data dimensionality and therefore computational costs.

The Weka library provides algorithms for the attribute selection. Some of them were previously analyzed by the authors [31]. In this work, we apply the principal component analysis (PCA) algorithm [32]. PCA looks for linear combinations between attributes to remove their individual dependency (noise) and to reduce the original data.

2.3. Train and Prediction

Once the document matrix is built and optimized, it can be used as input in a classifier. Train and prediction are divided in two complex steps: (i) choosing a classifier, in which a model must be selected, trained, and tested, and (ii) parameter tuning, involving algorithms and techniques in order to fit the classifier parameters and obtain better results.

2.3.1. Choosing a Classifier

In the last step of the process, a reasoning model is selected to classify those documents contained in the dataset as relevant and nonrelevant.

To perform this task, several algorithms supported by the scientist community were analyzed: -nearest neighbor [4, 7], naive Bayes [7, 33], and SVM [6, 34]. Finally, we choose the SVM classifier because it gets the best results with regard to the text classification [7, 35, 36].

2.3.2. Understanding the Support Vector Machines

SVMs were developed from the theory of statistical learning and structural risk minimization [12, 37]. In almost all cases, linear or nonlinear, a new decision surface is calculated, mapping the input space through a function in which samples are separable. Thus, the idea behind SVMs consists of discovering a hyperplane to discriminate positive and negative samples (relevant and nonrelevant documents).

To understand how it works, consider a separable training set in the input space equal to the identity function (linear case), with and , and a linear decision function , enclosed by support vectors defining the maximum margin between positive and negative samples, where is the bias hyperplane off-set determined by Karush-Kuhn-Tucker conditions.

In order to get an optimal hyperplane, a quadratic programming optimization must be considered:where is a slack variable (computed during optimization) which serves to control training errors and keep constraints up, is the trade-off parameter for controlling the compromise between the margin maximization and violated restrictions (soft-margin), and are the equation coefficients. The class for a document is determined by the sign of (3), where the parameter is the number of support vectors previously calculated on (2). Consider

For nonlinear cases ( is not trivial), input samples are mapped to a feature space, dimensionally higher than original one, where a linear separation may be feasible (see Figure 2). The mapping process is achieved by applying a nonlinear kernel function over each pair of vectors.

As a result, a linear solution is discovered by getting the optimal hyperplane and solving the mentioned nonlinear case [38].

However, given the possible scenario about infinite dimensions in kernel space, the nonlinear mapping function cannot be formulated explicitly. A solution consists of expressing the matrix operations in the kernel space as dot products in the input space , so-called kernel trick [37]. Therefore, (3) is reformulated as follows to include the kernel mapping:

2.3.3. Parameter Tuning

The classification process based on SVM is usually supported by several kernels [9]: linear (5), radial basis function (RBF) (6) or Sigmoid (7):

Some kernel functions, such as radial or sigmoid, provide extra parameters to improve their transformation surface making it more suitable for the dataset morphology. Unfortunately, getting an optimal configuration for these parameters usually results in an NP-complete problem, requiring additional computation cost. Well-known solutions, restricted in almost all cases to a range of values, may be a brute-force search, heuristic methods, or genetic algorithms [35, 36].

2.3.4. Kernel Behavior

In order to understand which range of values are the most suitable to each case, the kernel behavior must be analyzed. As an example, an explanation about the behavior of the RBF kernel is considered here. Figure 3 describes how to measure the similarity between three vectors , , and using a RBF kernel, in which are more similar than . Sigma values and the search space were obtained based on practical guides and empirical tests [9, 36, 39, 40].

In practice, distances between samples are estimated using the same value for all cases and must be carefully selected.(i)Kernel values close to 1 mean that samples are in the same class. Otherwise, values close to 0 mean that samples are in different classes.(ii)If cosine values are close to 1, samples are very similar in the feature space. Otherwise, if cosine values are close to 0, samples are very dissimilar in the feature space.(iii)if values are gradually increased, the angle between vectors denotes that they are closer to each other, in the feature space, than other ones like . Therefore if values are increased, the first angle will increase less than the second one (see Figure 4).

Following the previous criteria, an intermediate value represents the best choice to compute the similarity for all vectors. However, in our tests and values were considered because they maintain both the smallest distance for and the largest for at the same time.

3. Proposed Classification Model

In this section, we introduce a novel text classifier based on SVM over a set of modified RBF kernels. It is developed to manage highly unbalanced data, to autoparameterize itself under low computational cost, and to improve results against brute-force search.

The idea behind the model consists of spreading the dataset into cohesive term slices (clusters) to construct a defined structure. Each cluster is attached with an RBF kernel and the remaining (terms not considered for clustering) are enclosed in a linear kernel creating a multikernel model.

The model was developed to solve cases in which a dataset contains very similar samples for both classes, such as scientific corpora, making it difficult to obtain good results using conventional kernels on SVM classifiers.

Figure 5 shows a brief schema of the architecture.(1)For the first step, we use a dataset to generate a document matrix using the vector model. To identify the most relevant terms, a stemming algorithm (Lovins stemmer [25]) and a stopword list extracted from GATE tool [41] are used. In addition, the TF-IDF normalization is used to weight terms based on their frequencies.(2)For the second step, data dimensionality is managed through the principal component analysis (PCA). Terms are compacted using linear combinations between them. As a result, a matrix containing the new subset is generated.(3)The third step starts by transposing the term matrix to build a hierarchical clustering per each linkage method available (see [37, 42] for more details about hierarchical clustering). It continues analyzing resultant hierarchies with the cophenetic matrix to determine which hierarchy best fits the matrix. Finally, the Kolmogorof-Smirnov normality test [43] is applied to each cluster and hierarchy level to determine the optimal cut.(4)The last step consists of building the multikernel Linear-RBF (MLRBF) in the existent clusters from the optimal cut.

All these steps are optimized in order to reduce the computational cost and improve the results.

The next subsections explain each step of the process. The matrix generation is omitted or simply referenced.

3.1. Principal Component Analysis

The principal component analysis (PCA) [32] is usually used on text mining to reduce the data dimensionality with a minimum risk of information loss. Dimensionality reduction is accomplished by choosing the eigenvectors, which contain a certain percentage of variance (based on their eigenvalues) with respect to the original data and transforming them.

As a result, the document matrix is reduced according to the linear combination of the most representative terms (the most dispersed), transforming the input space into a smaller one. The new terms are known as principal components.

In this paper, PCA is not used as data reduction algorithm. The internal process looks for linear combinations between terms, producing components which may follow a normal distribution. This means, for relevant documents, that there are terms which are approximately normally distributed. However, this is not the case in the irrelevant documents.

As a consequence, terms may be agglomerated into cohesive groups (clusters) causing the matrix fragmentation (Figure 6), allowing a better adjustment in the next steps.

The steps to perform a PCA are listed below.(1)Terms are standardized (zero mean and unit variance, see the following equation) to ensure the independence of each resultant component: where is arithmetic mean of term and : standard deviation of the frequencies of the term in the corpus.(2)Once the terms are standardized, a correlation matrix is computed and the eigenvalues and eigenvectors are obtained.(3)Terms are sorted in descending order taking into account their eigenvalues.(4)Terms with a variance (eigenvalue) lower than 90% are discarded [44].(5)Terms not discarded (principal components) are used to transform the input space based on their eigenvectors.

3.2. Clustering

To divide the matrix in cohesive parts several agglomeration (clustering) techniques have been analyzed.

Well-known solutions such as -means [45] or COWEB [42] were discarded due to their initial parameterization. Hierarchical techniques [37] are the most adequate for our problem because they are not subject to initial requirements for clustering morphology, making it possible to analyze the clusters to determine which size and number are best for an optimal process.

The following subsection helps to get a better understanding of hierarchical clustering and the algorithms that have been developed to analyze the output.

3.2.1. Hierarchical Clustering

On the hierarchical clustering, entities are agglomerated into groups (clusters) and hierarchically ordered as a heap structure (see Figure 7). Each upper level on the structure contains more components than the previous one due the fact that the clusters are fused into new ones, thus increasing their size. To determine which clusters are the most suitable, each level structure is usually analyzed by algorithms that achieve the optimal cut.

Each fusion level is determined by a linkage algorithm which selects the most suitable clusters. Special care is taken in choosing an appropriate linkage method, since it directly affects the final cluster set. In order to do so, a correlation analysis between the original dataset and the resultant hierarchical clustering is performed through the calculation of the cophenetic coefficient [46]. Thus, all linkage methods may be measured to obtain the best one from them.

Several linkage methods such as complete, single, average, median, or ward [47] were tested. The complete linkage (see (9)) is the most suitable method for our datasets. The search space for the complete linkage is focused on far clusters attempting to avoid the local minima problem. Considerwhere is the distance between the terms and and are term clusters.

A linkage algorithm also needs a metric to measure distances between terms. In this paper, Euclidean distance [47] is selected based on empirical results.

Regarding software, a hierarchical clustering algorithm was implemented (see Algorithm 1) in order to include a cophenetic analysis to determine the best linkage method or to calculate the optimal cut. The algorithm iterates over each linkage method computing its associated hierarchy and comparing the correlation through a cophenetic analysis. Once the best linkage method has been identified, its resultant hierarchy is returned.

Input: is a dataset matrix
LMS is a set of linkage methods
Result: Best agglomeration correspondence
(01) Truncate ();
(02) dmatrix ← CalculateDistances ();
(03) bestAgglomeration ← ;
(04) ccoef ← 0;
(05) foreach linkage method lm of the set LMS do
(06) linkageLevelList ← ;
(07) while Size (dmatrix) > 0 do
(08) linkageResult ← ComputeLevelLinkage (lm, dmatrix);
(09) linkageLevelList ← linkageLevelList {LinkageResult};
(10) UpdateDistanceMatrix (dmatrix, linkageResult);
(11) Restore (dmatrix);
(12) ccoefAux ← CompCopheneticMatrix (linkageLevelList, dmatrix);
(13) if ccoef < ccoefAux then
(14) ccoef ← ccoefAux;
(15) bestAgglomeration ← linkageLevelList;
(16) return bestAgglomeration

This implementation includes some operations such as an improved internal distance matrix structure (line 2), logic to restore the distance matrix (line 11), or manageable structures to store each fused level of the hierarchy. These basic operations were crucial to construct a competitive system, reducing the elapsed time needed to build the model.

3.2.2. Optimal Cutting Selection

Once the hierarchy is built, it is analyzed in order to determine the optimal level of the final agglomeration.

Although several well-known solutions [47] may be adapted to produce a feasible solution, some requirements about cluster morphology are not totally satisfied.(i)A final cohesive slice is considered as a normal multivariate cluster if all components have a normal distribution.(ii)An optimal level is only composed of final cohesive slices.(iii)Large clusters have a smaller probability of following a multivariate distribution than small ones.(iv)The size and number of the final clusters must be controlled.(v)An optimal cutting composed of very small clusters provides a better adjustment of the results (overfitting problem) but increases computational costs. On the contrary, taking only large clusters into account, computational costs are amenably reduced but result in a poorer fit.

Therefore, we present a new algorithm to divide the matrix into cohesive slices and support these requirements (see Algorithm 2).

Input: LL is the best agglomeration
Result: Optimal hierarchy level
(01) optimalLevel ← ;
(02) foreach level l in LL do
(03) counter ← 0;
(04) foreach cluster c in l do
(05) test ← MultiLillieforsTest ();
(06) if test is true then
(07) counter ← +1;
(08) aux ← NumClusters (l);
(09) if aux = counter then
(10) optimalLevel ←l;
(11) else
(12) break; //Breaks outer loop
(13) return optimalLevel

To evaluate each cluster, a Kolmogorof-Smirnov test with Lilliefors correction [43] was used. Thus, each term in a cluster is checked for a normal distribution, assuming that a multivariate cluster is one in which each term is considered normally distributed ( value greater than 0.05) [44]. In other words, there is no evidence to reject the null hypothesis per term and therefore that cluster may be considered as a normal multivariate distribution.

3.3. Training and Prediction

This last step is focused on parameterizing each kernel portion for training and prediction scenarios.

Our implementation was built on the LibSVM [9] library. Several parameters were set by taking into account the LibSVM practical guide [40]. The cost parameter was established to 1 according to a small margin, minimizing the trade-off between wrong classified samples.

On the other hand, the kernel (10), defined as a RBF and linear kernels composition (see Section 2.3) was parameterized based on the resultant clusters from the optimal hierarchy level:where number of clusters in optimal level, RBF kernel function, = terms of the cluster , and = terms nongrouped in any cluster.

Recalling the main idea about identifying cohesive slices to divide the matrix, each slice (cluster) is composed of normal distribution terms and defined as a common pattern (multivariate normal distribution). Clusters provide a simple way to parameterize RBF kernels in the proposed multikernel (see Figure 8).

Each parameter corresponding to a RBF kernel is computed taking into account the maximum eigenvalue of its associated cluster; specifically, for each term we compute its eigenvalue to finally obtain the maximum by cluster (see (11)). Note that all resultant values have been normalized to prevent absolute eigenvalues, by enclosing them in a defined range interval according the explanation in Section 2.3, to get better results. Consider

On the other hand, those terms which are not included in a cluster are enclosed by forming the linear kernel.

In addition, some changes in the input data were performed in order to precompute the multikernel output matrix due to the complexity of the proposed model.

In the training, each member in the matrix kernel is computed by applying the MLRBF function over all document vectors. However, the testing step differs in that each test sample is computed against each train vector. To illustrate it, Figure 9 shows how to compute the kernel matrix in a training scenario.

4. Results and Discussion

This section includes several tests of preconfigured SVM classifiers and the proposed multikernel from different viewpoints: classification results, model building time performance, and a comparative with other authors.

To evaluate the effectiveness of the model, some statistical measures were used: recall (fraction of relevant documents that are correctly classified), precision (fraction of documents correctly classified as relevant), -measure (harmonic mean between recall and precision), and kappa statistic (which takes the output confusion matrix of an evaluation and reduces it to one value).

As input dataset, the TREC Genomics 2005 corpus [13] was chosen due to the similarity between relevant and nonrelevant documents, offering more realistic classification scenarios. In 2005, the TREC committee provided a set of evaluation tasks to obtain valuable knowledge in biological fields by applying information extraction techniques.

The track is divided in two tasks. One of them consists in categorizing documents regarding different criteria (allele, expression, gene ontology annotation, and tumor) in the genomics domain. Thus, an ad hoc collection, extracted from 4.591.008 MEDLINE records, was prepared by experts to support the task [13]. Having four criteria, resultant records were reorganized to generate each respective corpus (see Table 1).

According to the imbalance problem, only allele and go annotation (GoA) corpora contain enough documents per class to perform our tests applying the subsampling technique. expression and tumor corpuses contain few relevant documents and the oversampling techniques or similar are needed, as used in [13].

Documents were processed to get a suitable structure for SVM classifiers (see Section 3). In addition, PCA was used to determine linear combinations between terms, reducing their amount considerably, and a random subsampling technique to filter instances. Randomly generated subsets contain a uniform distribution (1 : 1), that is, the same number of documents per class. As shown below, it produces different results because samples are not removed in the same order each time they are applied. Thereby, we generated 10 datasets per each corpus (allele and GoA) in order to get trusted results.

Regarding to the parameterization, our multikernel contains internal procedures to determine the most suitable value for each kernel parameter during the classification process (see previous sections). As preconfigured classifiers do not provide automatic methods to get a suitable configuration, sigmoid and RBF parameters (gamma and sigma, resp.) were determined by brute-force. Brute-force was implemented as a grid search, included in the LibSVM library [40], under a set of predefined range values . Gamma values (sigmoid case) were reduced to taking into account the performed empirical tests. In addition, cost SVM parameter was set to 1 for MLRBF case, to obtain a small-margin hyperplane due to its high precision. On the preconfigured kernels it was determined by the previous grid-search algorithm under a soft-margin range [,].

Table 2 shows a comparison between preconfigured classifiers and our model. Precision, recall, kappa, and -measure statistics were considered to measure the quality of each model. Results were grouped by their minimum (Min), maximum (Max), and average (Avg) values due to the amount of test cases per classifier.

Tests were performed in an Intel Core i7 at 3.8 Ghz with 8 Gb of RAM. They were restricted to one execution thread because preconfigured classifiers on LibSVM [9] are implemented under a single execution thread.

With the results, we can conclude that the multikernel obtains competitive values on both corpuses, and it presents a steady behavior in almost all situations compared to other classifiers. The average statistic shows which multikernel obtained the most stable results. As the subsampling technique produces random datasets, if the instances are easily separable by single kernels similar results to our proposed multikernel may be obtained. However, if the instances of different classes are too close, a hyperplane may be extremely difficult to trace since it has only one kernel. Thus, best results on average statistics like recall (0.823), precision (0.893), or -measure (0.858) were achieved by our model on both corpuses.

Table 2 also helps to identify which kernels got maximum or poor results by statistic. Concerning to the allele case, the Sigmoid kernel got some maximum results on the -measure (0.888) or recall (0.882) with gamma values close to zero, but its average or minimum stats are lower than those obtained by the multikernel. The behavior of the linear kernel was very similar to our model, obtaining a maximum peak on precision (0.913). Finally, the RBF got the worst results (sigma values close to ).

Regarding the GoA corpus, best results were achieved by MLRBF in almost all cases, except recall (Max case), demonstrating that, on those corpuses which are not easily separable on original space (linear case), our multikernel offers results with high values on precision and recall statistics, as a consequence of a very accurate hyperplane. We would like to clarify that the RBF kernel got a high value on the recall statistic at the expense of other stats, achieving poor results on the kappa statistic, meaning that classes were not correctly trained. Therefore, we can conclude that the RBF kernel is not suitable for the GoA corpus.

Table 3 shows the time to build a model respect a specific corpus, divided by folds.

As shown, the building time on linear and MLRBF kernels presents a similar steady growth because the parametrization time of our model is almost constant and linear kernels do not have parameters. Even so, the needed time to configure a linear-SVM is around 1/3 less than MLRBF with regard to allele and a bit less on GoA.

On the other hand, RBF and Sigmoid kernels enclose a high runtime cost because their parameterizations are determined by an external technique. Note that the brute-force cost increases depending on the search space size.

To conclude, the MLRBF kernel offers a good solution with low runtime cost compared with RBF or Sigmoid kernels, but the linear still continues getting the lowest cost.

As a final analysis, a comparison with other authors is included. The TREC 2005 Overview [13] document offers a wide explanation about tasks, tests, and results performed in the competition. Note that the categorization task is focused on how to maximize the number of relevant documents that are correctly classified (recall), putting aside other stats like precision or -measure. In order to compare our system against other authors we reorganized the results based on the -measure (-score) stat.

Table 4, extracted from TREC 2005 Overview, shows a comparison against other works of the conference. As seen, MLRBF offers trusted results with high precision (0.9102) on allele and 0.780 on GoA.

Several solutions on TREC were developed as a statistical system based on a semisupervised learning and modulating the original dataset under the medline mesh domain. Best results for other authors were achieved only if the mesh domain is used and no other data transformation is considered. Otherwise, applying other domains and building balanced systems, their results were similar to our system.

In conclusion, MLRBF may offer good results on general scenarios, even though a specific term domain is not specified.

5. Conclusions and Future Work

In this research, we present a new multikernel for SVM classifiers. The model divides the dataset in small portions to assign an independent kernel which is adjusted to take into account the containment of its slice.

The multikernel offers a stable behavior, thus avoiding some difficulties from text datasets. It achieved the best average results compared to other classifiers and some peaks as maximum values on precision or -measure on allele corpus. Regarding GoA, our model got the best results in almost all statistics except for recall, in which the RBF kernel got better results keeping aside other stats.

With regard to runtime cost, the multikernel approach obtains a steady growth curve similar to linear kernel. Even so, MLRBF needs less time to completely build the model than RBF or Sigmoid.

At the end, we compare the novel model against other existing works in TREC 2005 competition, concluding that our system raises results under a high precision and -measure stats even though a specific domain was not specified.

As future lines of work, we are going to focus our effort on reducing the computational cost of the preprocessing step when applying subsampling, which is known to produce different results because samples are not removed in the same order each time they are applied.

We are going to focus our effort on developing a preprocessing step which helps to avoid the quadratic programming cost, while at the same time solving the problem associated with applying the subsampling technique.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work has been funded by the European Union Seventh Framework Programme [FP7/REGPOT-2012-2013.1] under Grant agreement no. 316265, BIOCAPS, the “Platform of integration of intelligent techniques for analysis of biomedical information” project (TIN2013-47153-C3-3-R) from Spanish Ministry of Economy and Competitiveness and the [14VI05] Contract-Programme from the University of Vigo.

References

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513–523, 1988.
View at: Publisher Site | Google Scholar
R. Barandela, J. S. Sanchez, V. Garcia, and E. Rangel, “Strategies for learning in class imbalance problems,” Pattern Recognition, vol. 36, no. 3, pp. 849–851, 2003.
View at: Google Scholar
G. M. Weiss, “Mining with rarity: a unifying framework,” ACM SIGKDD Explorations Newsletter, vol. 6, pp. 7–19, 2004.
View at: Google Scholar
S. Tan, “Neighbor-weighted K-nearest neighbor for unbalanced text corpus,” Expert Systems with Applications, vol. 28, no. 4, pp. 667–671, 2005.
View at: Publisher Site | Google Scholar
L. Borrajo, R. Romero, E. L. Iglesias, and C. M. Redondo Marey, “Improving imbalanced scientific text classification using sampling strategies and dictionaries,” Journal of Integrative Bioinformatics, vol. 8, no. 3, p. 176, 2011.
View at: Google Scholar
P. Kang and S. Cho, “EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems,” in Neural Information Processing, vol. 4232 of Lecture Notes in Computer Science, chapter 93, pp. 837–846, Springer, Berlin, Germany, 2006.
View at: Google Scholar
R. Romero, E. L. Iglesias, and L. Borrajo, “Building biomedical text classifiers under sample selection bias,” in International Symposium on Distributed Computing and Artificial Intelligence, vol. 91 of Advances in Intelligent and Soft Computing, pp. 11–18, Springer, Berlin, Germany, 2011.
View at: Publisher Site | Google Scholar
R. Romero, E. L. Iglesias, L. Borrajo, and C. M. R. Marey, “Using dictionaries for biomedical text classification,” Advances in Intelligent and Soft Computing, vol. 93, pp. 365–372, 2011.
View at: Publisher Site | Google Scholar
C.-C. Chang and C.-J. Lin, “LIBSVM: a Library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011.
View at: Publisher Site | Google Scholar
K.-K. Tseng, Y. Li, C.-Y. Hsu, H.-N. Huang, M. Zhao, and M. Ding, “Computer-assisted system with multiple feature fused support vector machine for sperm morphology diagnosis,” BioMed Research International, vol. 2013, Article ID 687607, 13 pages, 2013.
View at: Publisher Site | Google Scholar
T. W. Pai, H. W. Wang, Y. C. Lin, and H. T. Chang, “Prediction of B-cell linear epitopes with a combination of support vector machine classification and amino acid propensity identification,” Journal of Biomedicine and Biotechnology, vol. 2011, Article ID 432830, 12 pages, 2011.
View at: Publisher Site | Google Scholar
W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, no. 8, pp. 879–886, 2008.
View at: Publisher Site | Google Scholar
W. Hersh, A. Cohen, J. Yang, R. T. Bhupatiraju, P. Roberts, and M. Hearst, “TREC 2005 genomics track overview,” in Proceedings of the 14th Text Retrieval Conference (TREC '05), pp. 14–25, November 2005.
View at: Google Scholar
M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of Machine Learning Research, vol. 12, pp. 2211–2268, 2011.
View at: Publisher Site | Google Scholar | MathSciNet
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods, Cambridge University Press, New York, NY, USA, 2000.
A. Ben-Hur and W. S. Noble, “Kernel methods for predicting protein-protein interactions,” Bioinformatics, vol. 21, supplement 1, pp. i38–i46, 2005.
View at: Publisher Site | Google Scholar
I. M. de Diego, J. M. Moguerza, and A. Muñoz, “Combining kernel information for support vector classification,” in Multiple Classifier Systems, F. Roli, J. Kittler, and T. Windeatt, Eds., vol. 3077, pp. 102–111, Springer, Berlin, Germany, 2004.
View at: Publisher Site | Google Scholar
I. M. de Diego, A. Muñoz, and J. M. Moguerza, “Methods for the combination of kernel matrices within a support vector framework,” Machine Learning, vol. 78, no. 1-2, pp. 137–174, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), pp. 41–48, ACM, New York, NY, USA, July 2004.
View at: Google Scholar
C. Igel, T. Glasmachers, B. Mersch, N. Pfeifer, and P. Meinicke, “Gradient-based optimization of kernel-target alignment for sequence kernels applied to bacterial gene start detection,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 2, pp. 216–226, 2007.
View at: Publisher Site | Google Scholar
T. Damoulas and M. A. Girolami, “Pattern recognition with a Bayesian kernel combination machine,” Pattern Recognition Letters, vol. 30, no. 1, pp. 46–54, 2009.
View at: Publisher Site | Google Scholar
M. Girolami and S. Rogers, “Hierarchic bayesian models for kernel learning,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 241–248, August 2005.
View at: Publisher Site | Google Scholar
K. P. Bennett, M. Momma, and M. J. Embrechts, “MARK: a boosting algorithm for heterogeneous kernel models,” in Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 24–31, ACM, New York, NY, USA, July 2002.
View at: Google Scholar
J. Bi, T. Zhang, and K. P. Bennett, “Column-generation boosting methods for mixture of kernels,” in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '04), pp. 521–526, New York, NY, USA, 2004.
View at: Google Scholar
J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computational Linguistics, vol. 11, pp. 22–31, 1968.
View at: Google Scholar
M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980.
View at: Google Scholar
J. Zhang and I. Mani, “kNN approach to unbalanced data distributions: a case study involving information extraction,” in Proceedings of the Workshop on Learning from Imbalanced Datasets (ICML'03), 2003.
View at: Google Scholar
Q. Zou, Z. Wang, X. Guan, B. Liu, Y. Wu, and Z. Lin, “An approach for identifying cytokines based on a novel ensemble classifier,” BioMed Research International, vol. 2013, Article ID 686090, 11 pages, 2013.
View at: Publisher Site | Google Scholar
A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Computational Intelligence, vol. 20, no. 1, pp. 18–36, 2004.
View at: Publisher Site | Google Scholar | MathSciNet
S. R. Garner, “WEKA: the Waikato environment for knowledge analysis,” in Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64, 1995.
View at: Google Scholar
R. Romero, E. L. Iglesias, and L. Borrajo, “A comparative analysis of balancing techniques and attribute reduction algorithms,” in 6th International Conference on Practical Applications of Computational Biology & Bioinformatics, vol. 154 of Advances in Intelligent and Soft Computing, pp. 87–94, Springer, Berlin, Germany, 2012.
View at: Publisher Site | Google Scholar
I. T. Jolliffe, Principal Component Analysis, Springer, New York, NY, USA, 2nd edition, 2002.
View at: MathSciNet
S. Kim, H. Rim, D. Yook, and H. Lim, “Effective methods for improving Naïve Bayes text classifiers,” in Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence (PRICAI '02), pp. 414–423, Springer, 2002.
View at: Google Scholar
Y. Tang, Y.-Q. Zhang, and N. V. Chawla, “SVMs modeling for highly imbalanced classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 39, no. 1, pp. 281–288, 2009.
View at: Publisher Site | Google Scholar
S. Ali and K. A. Smith, “Automatic parameter selection for polynomial kernel,” in Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI '03), pp. 243–249, 2003.
View at: Google Scholar
C.-H. Li, H.-H. Ho, Y.-L. Liu, C.-T. Lin, B.-C. Kuo, and J.-S. Taur, “An automatic method for selecting the parameter of the normalized kernel function to support vector machines,” Journal of Information Science and Engineering, vol. 28, no. 1, pp. 1–15, 2012.
View at: Google Scholar | MathSciNet
B. Scholkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, Mass, USA, 2001.
T. Hill and P. Lewicki, Statistics, Methods and Applications, StatSoft, Tulsa, Okla, USA, 2007.
J. T. Chang, S. Raychaudhuri, and R. B. Altman, “Including biological literature improves homology search,” in Proceedings of the Pacific Symposium on Biocomputing, pp. 374–383, 2001.
View at: Google Scholar
C. Hsu, C. Chang, and C. Lin, A practical guide to support vector classification, 2010.
H. Cunningham, Y. Wilks, and R. J. Gaizauskas, “GATE—a general architecture for text engineering,” in Proceedings of the 16th Conference on Computational Linguistics (COLING '96), pp. 1057–1060, Copenhagen, Denmark, August 1996.
View at: Google Scholar
D. H. Fisher, “Knowledge acquisition via incremental conceptual clustering,” Machine Learning, vol. 2, no. 2, pp. 139–172, 1987.
View at: Publisher Site | Google Scholar
A. Chernobai, S. Rachev, and F. Fabozzi, Composite Goodness-of-Fit Tests for Left-Truncated Loss Samples, Department of Statistics and Applied Probability, University of California, Santa Barbara, Calif, USA, 2005.
B. F. J. Manly, Multivariate Statistical Methods: A Primer, Chapman & Hall, New York, NY, USA, 2nd edition, 1994.
View at: MathSciNet
D. Arthur and S. Vassilvitskii, “k-Means ++: The advantages of carefull seeding,” in Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035, 2007.
View at: Google Scholar
J. S. Farris, “On the cophenetic correlation coefficient,” Systematic Zoology, vol. 18, no. 3, pp. 279–285, 1969.
View at: Google Scholar
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2nd edition, 2005.

Copyright

Copyright © 2015 R. Romero et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

5415

Downloads

1287

Citations