BioMed Research International

Volume 2015 (2015), Article ID 878291, 14 pages

http://dx.doi.org/10.1155/2015/878291

## A Linear-RBF Multikernel SVM to Classify Big Text Corpora

Department of Computer Science, Higher Technical School of Computer Engineering, University of Vigo, 32004 Ourense, Spain

Received 22 August 2014; Revised 10 November 2014; Accepted 13 November 2014

Academic Editor: Juan M. Corchado

Copyright © 2015 R. Romero et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Support vector machine (SVM) is a powerful technique for classification. However, SVM is not suitable for classification of large datasets or text corpora, because the training complexity of SVMs is highly dependent on the input size. Recent developments in the literature on the SVM and other kernel methods emphasize the need to consider multiple kernels or parameterizations of kernels because they provide greater flexibility. This paper shows a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search. The model consists in spreading the dataset into cohesive term slices (clusters) to construct a defined structure (multikernel). The new approach is tested on different text corpora. Experimental results show that the new classifier has good accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.

#### 1. Introduction

The amount of information stored in public resources continues to grow. For example, the Medline bibliographic database, the most important source in the biomedical domain, stores documents since 1950 and contains more than 22 million citations. Thus, in order to manage this volume of documents, the use of sophisticated computer tools must be considered.

In the last years, researchers show a special interest of applying text mining techniques to the field of biomedicine, as pattern recognition, automatic categorization, or classification techniques. In order to get good results, the need to establish a unified data structure to represent documents must be accomplished.

A well-known data structure supported by the scientific community is the* sparse matrix* [1], which is commonly managed by classifiers as input data. In it, each document is decomposed as a vector of its more relevant terms (words).

Unfortunately, although an efficient data structure solves problems related to performance, other inconveniences about the size of the corpora impact negatively over classifiers and their accuracy. Data imbalance problems exist in a broad range of experimental data and have captured the attention of researchers [2, 3]. Data imbalance occurs when the majority class in a document corpus is represented by a large portion of documents, while the minority class has only a small percentage [4]. When a text classifier encounters an imbalanced document corpus, the performance of machine learning algorithms often decreases [5–8].

Another important situation in a classification process, which can render the problem unmanageable, is related to the sparse matrix dimensionality. The matrix dimension is directly connected to the amount of attributes (terms) of the documents included in it, affecting the performance of the classifier and attaching a high computational cost. At this point, algorithms to select relevant terms from whole data structure must be considered. As a result, an optimized sparse matrix is generated.

Regarding classifiers, support vector machine (SVM) [9–12] is one of the most well-known classification techniques used within the scientific community. It obtains good results in a variety of classification problems, although it is difficult to determine its parameterization with imbalanced data. A SVM classifier uses a kernel function to make a transformation over the data and change the workspace, separating relevant from nonrelevant documents. Taking into account that some kernels have additional parameters that must be selected, the parameterization of a SVM has a high cost.

As with other classifiers, SVMs are not suitable to classify large datasets due to their high training complexity. Support vectors are internally computed to represent the dataset; this helps to find a hyperplane that separates the contents of each class. The complexity of a SVM is given by the number of support vectors needed to get the hyperplane. Data dimensionality negatively affects the kernel coverage, such that a unique kernel may not be enough to get an optimal division between classes.

One solution is to divide the dataset into small portions, attaching a specific kernel to each slice, decreasing the training complexity, and improving classification results. This idea is known as a* SVM based on a multikernel transformation*.

Multikernel algorithms combine predefined kernels in order to obtain accurate hyperplanes. These kernels and their parametrization are usually determined by different learning methods. However there is not an efficient learning method to cover all classification scenarios, because it is highly dependent of the field of study. Gönen and Alpaydın [14] establish a category of existing multikernel algorithms focused on their learning methods and properties.(i)Fixed rules are functions which combine multiple single kernels grouping them as sums or products and working over the data slice by slice [15, 16]. Kernels are usually unweighted and do not need any training before applying them. However, other approximations include several coefficients to weigh each term in order to penalize some multikernel parts. Even so, value coefficients are adjusted based on empirical results or brute-force algorithms.(ii)Heuristic approaches combine the idea behind fixed rules to weigh each multikernel term under best coefficient values [17, 18]. These values are usually determined by unsupervised algorithms such as ID3 trees, hierarchical clustering, or self-organizing maps, among others, which may be applied separately (term by term) or one over all of them. In almost all cases, the search space is extremely wide (original or feature), becoming the scenario in a NP-complete problem. Thus, the computational cost and the system performance must be taken into account.(iii)Optimization approaches consist in providing optimal values for kernel function parameters. Usually, based on external models, this optimization can be integrated as a part of a kernel-based learner or reformulated as a different mathematical model for obtaining the parameter values, and then parametrize the learner [19, 20].(iv)In Bayesian approaches, kernels are combined and interpreted as probabilistic variables. These parameters (in kernels) are used to perform inference for learning them and the base learner parameters. Bayesian functions measure the quality of the resulting kernel function constructed from candidate kernels using a Bayesian formulation. In general, we use as target function the likelihood or the posterior to find the maximum likelihood estimator and then obtain the model parameter values [21, 22].(v)Boosting approaches, inspired on ensemble algorithms, combine weak learning models to produce a new complex strong one [23]. A set of pairwise SVM-kernels may be configured and trained separately to get a final voting result in testing stage. There are different ways in which the combination can be done, including the previous approaches. The models may be predefined or it is possible to add a new kernel until the performance stops improving [23, 24].

In this paper, we show a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search.

The remainder of the paper proceeds as follows. The general text classification model is described in Section 2. The proposed model is presented in Section 3, matching and explaining differences with the previous section. The analysis of experimental tests and comparative results with other authors are shown in Section 4. Finally, the most relevant conclusions are collected at Section 5.

#### 2. Text Classification

Text classification is focused on assigning a class to each document of a corpus. Thus, a class encloses those documents which are representative from a specific topic. The class assignment can be performed manually or automatically.

In general, the text classification process includes a set of steps, as shown in Figure 1. These steps are detailed in the next subsections.