Abstract

In direct proportion to the heavy increase of online information data, the attention to text categorization (classification) has also increased. In text categorization problem, namely, text classification, the goal is to classify the documents into predefined classes (categories or labels). Recently various methods in data mining have been experienced for text classification in literature except polyhedral conic function (PCF) methods. In this paper, PCFs are used to classify the documents. The separation algorithms via PCFs which include linear programming subproblems with inequality constraints are presented. Numerical experiments are done on real-world text datasets. Comparisons are made between state-of-the-art methods by presenting obtained tenfold cross-validation results, accuracy values, and running times in tables. The results verify that in text classification PCF methods are as effective in terms of accuracy values as state-of-the-art methods.

1. Introduction

The supervised data classification is one of the essential fields in data mining. The researches regarding this field deal with the categorization of data for its most effective and efficient use. The objective of supervised data classification is to determine rules on the training set for the data classification. This set consists of some features of data whose labels (classes or categories) are known. To discover the system, training subsets of the given dataset are used and utility of the obtained rules is examined on the test set. It has so many application areas such as medicine, engineering, business, and education [14]. Various learning algorithms for supervised data classification have been defined in machine learning. For instance, linear regression, logistic regression, decision tree, support vector machines, Naive Bayes, K-nearest neighbour, K-means, random forest, dimensionality reduction algorithms, and gradient boost and adaboost are the most commonly used ones [5].

The process of supervised data classification, where the dataset consists of text data, is called text classification. With the heavy increase of online information, it has been so difficult to control, present, and archive the text data uniformly. Text classification has been one of the main techniques for organizing text data and it is used for classifying columns and news in terms of their subjects, to help a user's search on hypertext, to surf on the Internet, and so forth. Because finding text classifiers by hand is gruelling and time-consuming, data mining techniques are utilized in text classification [6, 7].

For text classification, besides the commonly used supervised classification techniques, we wish to experience polyhedral conic functions as supervised classification functions that map documents to labels (classes) [8]. In the following state-of-the-art review, we sketch out some of learning techniques used for text categorization in literature. The process of text classification will be examined and mathematical model of a text classification problem will be presented in Section 3. In the fourth section, polyhedral conic functions are explained and utilization of these functions in data classification will be mentioned by presenting the algorithms in literature. In the fifth section, defined algorithms via polyhedral conic functions are regulated for text categorization problems. In the sixth section, numerical experiments are done by implementing defined algorithms on a determined real-world dataset. Obtained running time, training, and test accuracy values are presented in tables. Also for comparison with state-of-the-art methods and to see the efficiency of defined algorithms on large datasets, implementations are made on various real-world datasets from UCI (machine learning repository). Finally in the last section the paper is concluded.

In the literature, several authors have proposed approaches for text classification problem. Text categorization (text classification) is the process of automatically labeling a set of documents into classes (categories) by using predefined training dataset. The researchers are so interested in text classification studies because of the development of technology and increase in the number of the electronic documents available in several sources. The whole process of text classification has some steps that will be introduced in the third section. In our study, we focus on the step of data mining (learning models). Since we work on a supervised learning model in text classification, in this section of related works, we sketch out some of machine learning techniques commonly used in literature in training a text classification model by explaining the approaches that they use.

K-nearest classifier method is based on the hypothesis of the class (category or label) of a sample that is most similar to the class of other samples that are closest in the vector space. The training sets are viewed in multidimensional feature space. Here, the training set is divided into zones in terms of the defined classes. In the feature space, an instance is assigned to a specific class if it is the most proper class among the number of -nearest training data. Commonly Euclidean Distance is used as distance metric between the points. This method is usable since various similarity measures can be used for describing neighbours of an instance [9]. A comparative study of KNN and SVM methods was done in [10]. And also in [1113], KNN method in text classification is examined.

Rocchio’s method is a vector space method for document filtering or routing in informational retrieval. In this method, a prototype vector for each class is created by the help of training set, for instance, the mean vector of points in class of . Similarity between test data (document) and each of prototype vectors is calculated. Finally test data is assigned to the class with maximum similarity [14]. In [1517], this method is examined for text categorization and information retrieval. In [18], a new algorithm called HI-Rocchio is proposed. This algorithm combines two methods: Rocchio’s method and Hierarchical clustering. In their experimental results, they verified the effectiveness of the algorithm.

Naive Bayes method is based on probability. The optimal class in NB method is the most likely or maximum a posteriori (MAP) class cmap:

Here is adocument; c C is a predicted class where C= {, ,…, } is a fixed set of classes. is a measure of how much evidence contributes that c is the right class. P(c) is the prior probability of a document that belongs to class of c [9].

In [1922], NB method is examined and performance of NB algorithms is compared with other learning methods.

The decision tree method uses the form of a tree structure for classification of training documents. In the structure of a decision, leaves symbolize the class of documents and branches symbolize connectors of features that conduct to those categories [10]. In [10, 23, 24], decision tree models in text categorizations are examined.

Support vector machine (SVM) is a machine learning method defined by V. Vapnik et al. in 1990. Discriminant-based optimization is used and linear separator parameters are found by using labeled datasets in this method. SVM method is utilized by many researchers in different areas [25]. In [6, 7, 10, 12] SVM learning method is studied for text categorization and comparisons with other learning methods in different datasets are proposed. In [26], news articles are used to predict intraday price movements of financial assets by using SVMs algorithm in training process with a given kernel matrix. Multiple kernel learning is used to combine equity returns with text as predictive features. It is seen that text features producing significantly better performance than historical returns alone.

Classification via regression method uses regression methods for classification. Class is binarized and one regression model is built for each class value. In [22] classification via regression is used for detection of child exploiting chats from a mixed chat dataset as a text classification task and it is seen that Naive Bayes and this method compete each other such that they detect almost the same number of child exploitation chats.

In addition to these, text classification is studied by combining text classifiers by different researchers to improve the efficiency of classification. In [27], Fragos K. et al. combined the methods that belong to the same paradigm-probabilistic. Naive Bayes and maximum entropy classifiers are combined to test on the applications where the individual performance is good. In [28], S. Keretna et al. combined the individual results of Conditional Random Field (CRF) classifiers and maximum entropy (ME) classifiers on the medical text. They all get better performance results than the individual classifiers. All the combined text classifiers till 2016 are reviewed in [29].

In [30], all these methods are compared and discussed with their improvements. The authors see that each researcher has their own datasets for testing the improvement which makes the comparison more difficult. Because of this reason, in this paper, besides our own dataset for testing, commonly used and easily accessible benchmark datasets are used in the testing phases.

The most recent article that overviews the state-of-the-art elements in text classification is published by Mironczuk M. and Protasiewicz J. in [31]. They reviewed the works dealing with text classification according to data collection, data analysis for labeling, feature construction and weighting, feature selection and projection, training of a classification model, and solution evaluation. They found numerous papers on the issue of training algorithms in text classification [3235]. In their work, they found two more training methods of a classification function in the literature different from the above given approaches: neural network classifier and artificial immune systems studied, respectively, in [33, 36].

In this study, we experiment the data mining process of text classification by using a different classifier as distinct from above approaches in literature. We aim to get better performance results than the previous approaches, by using mathematical programming and utilizing polyhedral conic functions in training algorithm of text categorization process.

3. Text Classification

The solution of data classification problem consists of two steps. In the first step, a classifier function which describes a predetermined set of data classes is built. It is called learning step on training set. A classification algorithm builds the classifier by analyzing a training set made up of a dataset and its associated class labels. In the second step, obtained classifier function is tested on a test set. The effectiveness of a classifier function is determined by the evaluation process. All these steps and preparation processes are explained in the following paragraphs for text classification task.

Text classification, namely, text categorization, aims at classifying the documents into a fixed number of predefined classes (labels). In order to get good text classification results, the choice of a proper and effective algorithm plays an important role. Merely, the whole process of text classification should not be ignored. The steps of this process can be given as follows:(i)Determining of text data collection(ii)Text preprocessing(iii)Attribute selection(iv)Text transformation(v)Data mining(vi)Evaluation

In determining of text data collection, document datasets (like html, pdf, doc, web content, etc.) are constituted. These datasets consist of many words.

In text preprocessing, the text documents are presented into clear word format, e.g., expression to express, behaviour to behave. These words are cleaned out from stop words, conjunctions, and meaningless expressions, and then roots of words are determined. Commonly the steps taken in text preprocessing are Tokenization and Removing Stop Words like frequently occurring “the”, “and”, etc. [37].

In attribute selection part, important words in preprocessed documents are detected and nonrelevant words, for instance, words that are placed in the whole documents or nearly in all of documents, are eliminated.

In text transformation, documents are defined with a goal-oriented suitable representation for learning algorithm. Namely, unstructured data should be transformed into structured data. Here the aim is to reduce the complexity of the documents for an easy managing procedure by transforming the full text version of the document to a document vector. Vector space model (SMART) where documents are represented by vectors of words is the commonly used document representation. Some of the limitations of this model are high dimensionality of the representation, loss of correlation with adjacent words, and loss of semantic relationship that exists among the terms in a document. To overcome these problems, term weighting methods are used to assign appropriate weights to the term [37].

In vectorial representation, the term-document, d×t, matrix is created; here represents the numbers of documents and represents the numbers of the terms. The value in the (i,j)th entry of d×t matrix stands for the density of jth term in ith document. By using d×t matrix, any documents from the collection can be represented by various methods such as bag of words, vector space model (SMART).

The used document in this paper is represented by vectorial using. TF(i,j), that is called term density, is the weight of jth term in ith document. IDF(j), that is called inverse document density, is the weight of jth term in all collection for a d×t term-document matrix. Classical formula of TF-IDF is as follows:where w(i,j), TF(i,j), IDF(i,j). Here w(i,j) is called the weight of jth term in ith document.

In data mining step, a proper and effective method and algorithm are chosen and implemented to the transformed dataset. Some methods as Naive Bayes, Rocchio’s method, and k-nearest classifier are used for data classification of text data. Besides we foresee that the separation via PCFs methods based on mathematical optimization can be applicable on text data. So we experiment the PCFs separation algorithms on a real-world dataset in this paper. Separation with PCFs is expressed in detail in Section 4.

Mathematical model of a binary classification problem can be introduced as linear separability or polyhedral separability. They are explained as follows in [38].

Let and be given sets containing and n-dimensional vectors, respectively:

The sets and are linearly separable if there is a hyperplane , with , such that,for any i=1,..., m,for any j=1,..., p,

A characterization of linear separability is that the convex hulls of the two sets do not intersect. If the intersection is not empty, it is possible to obtain a hyperplane that minimizes some misclassification measure or even to look for nonlinear separating surfaces. The problem of finding this hyperplane is formulated as the following optimization problem [39]:whereis an error function. Here stands for the scalar product in . It is shown that the given minimization problem is equivalent to the following linear program [39]:subject towhere is nonnegative and shows the error for the data and is nonnegative and shows the error for the data .

The concept of h polyhedral separability was introduced in [40]. The sets A and B are polyhedrally separable if there is a set of hyperplanes , with such that(1)for any and (2)for any there is at least one such that

The problem of polyhedral separability of the sets and is reduced to the following problem [40]: whereis an error function. In [40], also an algorithm for solving defined minimization problem is developed. The calculation of the descent direction at each iteration of this algorithm is reduced to a certain linear programming problem.

Besides, all introduced mathematical optimization techniques can be applied for multiclass classification problems, where we have more than two classes, by using one versus all strategy. This means that for given dataset A with q≥2 classes A1,…,Aq, any class Aj, , is taken as the set A and the set B is defined as a union of all remaining classes [41].

In a text classification problem, a definition of a document is given; here is the document space that includes blog posts, news stories, articles, web pages, and technical reports; and a constant set of classes . The classes are in general subjects, authors, and topics but may also be based on types and interests. Classes are human defined for needs of the problem. This is a supervised learning problem since we study with a given training set of labeled document shown inFor example, (d,c) =(mathematical optimization, life sciences) indicates that mathematical optimization document is labeled with life sciences.

When we turn back to the subject of representation of the document collection, since we are working on supervised classification, we should add a new column to d×t matrix such that the value in last column represents the classes of the documents. Thus we use a d×(t+1) matrix during the text classification algorithm. Here is the number of documents and is the number of attributes (e.g., word stems).

Here, the objective is to find rules (functions) under favour of training set, d×(t+1) matrix, and evaluate the efficiency of the obtained rules (functions) on the test set.

Correspondingly the text classification problem’s dimension is directly related to the number of documents and the word stems exist in the whole document collection that constitutes d×(t+1) matrix.

In performance evaluations, many measures have been used, such as F-measure, fallout, error, and accuracy. In this paper, accuracy values of training and testing phases are calculated by implying cross-validation method. These subjects will be viewed in detail in Section 6.

In the following section, an approximation via polyhedral conic functions based on mathematical optimization is expressed.

4. Classification via Polyhedral Conic Functions (PCFs)

Polyhedral conic functions (PCFs) have been introduced in 2006 by Gasimov and Öztürk to separate two different labeled point sets, in other words, to split two discrete datasets [8]. Every point is represented with a vector whose every index except the last corresponds to an attribute of a point (data) and the last index stands for the class (label) of the point.

Polyhedral functions are defined as follows in [8]:where is an n-dimensional point (vector), , .

Definition 2 and Lemma 1 quoted below are given and proved in [8].

Lemma 1. A graph of the function defined in (16) is a polyhedral cone with a vertex at . This cone is called a polyhedral conic set and its center.

It follows from Lemma 1 that every polyhedral function given in (16) performs as a polyhedral conic function (PCF).

Definition 2. A function is called polyhedral conic if its graph is a cone and all its level sets are polyhedrons.

The first separation algorithm via PCFs was defined in [8] as follows:

Let and be given sets containing and n-dimensional vectors, respectively:

Algorithm 3. Binary classification via PCFs.
Step 0 (initialization step). Let l=1, Il=I, Al =A and go to Step 1.
Step 1. Let al be an arbitrary point of A. Solve subproblem (Pl). Let be a solution of . LetStep 2. . If go to Step 1.
Step 3. Determine the function (parting the sets and B) asand stop.

This algorithm was modified for binary classification problems in [42, 43]. Clustering algorithm is added to the initialization step to decrease running time by reducing the step size that is required for finding the center points of polyhedral conic functions. Clustering algorithms form groups of objects that share common properties [44]. Several algorithms have been studied for clustering method [45, 46]. In [43], one of the most efficient clustering algorithms, k-means method, was used and also in [42], k-medoids method that differs from k-means in the determined center points’ features was experienced. Besides, relaxation was applied to () subproblem constraint (20) to avoid extra variations between accuracy values of training and test sets (called overfitting) by allowing () misclassification as in (26). In conjunction with the applied change subproblem (18) is changed as in (24). The modified PCF algorithm was defined in [43] as follows.

Algorithm 4. Binary classification via PCFs and clustering method.
Step 0 (initialization step). Apply k-means clustering algorithm over set of . Let be the number of clusters and k=1. =I.
Step 1. Let be the center of th cluster. Solve subproblem . Let be a solution of () . LetStep 2. If , let , and go to Step 1.
Step 3. Determine the function (parting the sets and B) asand stop.

5. PCF Algorithms for Text Categorization

In this paper, PCF algorithms are used for text categorization. Algorithms 3 and 4 are both defined for binary classification problems; merely lots of text categorization problems include more than two classes so we should use the multiclass classification algorithms. The only difference between binary and multiclass classification problems is the number of the classes. For this reason binary classification methods can be simply adapted to multiclass classification problems by applying Algorithm 3 or 4 (binary classification algorithm) between each class and the rest. The number of classifiers formed during the algorithm is “n.k”; here “n” is the number of classes and “k” represents the number of clusters. In every iteration, binary classification algorithm is implemented to Aj, j=1,2,…,n and AAj sets so “k” different classifiers are formed. In testing phase, the class of “a” point is defined by

Therefore, the finisher separating function is identified as the pointwise minimum of all functions that is formed after binary classifications:

A multiclass classification algorithm, using clustering method and polyhedral conic functions, is defined as follows in [42].

Algorithm 5. Multiclass classification algorithm using clustering method and PCFs.
Step 0 (initialization). Let , l=1.
Step 1. .
Step 2. Apply clustering algorithm in . Let be the number of clusters and s=1, , and .
Step 3. Let be the sth center of . Solve subproblem.Let be the solution of ,Step 4. If , let s=s+l, and go to Step 3.
Step 5. If l<c, let l=l+1 and go to Step 1.
Step 6. Determine the function g(x) parting , l=1,…, c, as follows:and stop.

Algorithm 5 is constituted from Algorithm 4 but misclassifications are not added as in (26) constraint; it is abandoned as in (20) constraint of Algorithm 3. In [47], the added form of Algorithm 5 is defined as follows.

Algorithm 6. Multiclass classification algorithm that allows misclassifications for both of the sets besides clustering method and PCFs.
Step 0 (initialization). Let , l=1.
Step 1. .
Step 2. Apply clustering algorithm in . Let be the number of clusters and s=1, , and .
Step 3. Let be the sth center of . Solve subproblem.Let be the solution of ,Step 4. If , let s=s+1, and go to Step 3.
Step 5. If l<c, let l=l+1 and go to Step 1.
Step 6. Determine the function g(x) parting , l=1,…,c, as follows:and stop.

As is seen, in the whole given algorithms, the linear programming subproblem includes inequality constraints (see (19), (20), (25), (26), (33), (34), (39), and (40)). These inequality constraints ensure classifying the text into the right category (class) by allowing misclassifications (,) as in (19), (25), (26), (33), (39), and (40). In inequalities of (20) and (34) constraints, no misclassifications are allowed by determining the ==0. While inequality constraints with “>0” ensure the data to be located outside of the obtained polyhedral conic function, inequality constraints with “< 0” ensure the data to fall into the obtained polyhedral conic function.

In the following section, given algorithms will be implemented on real-world text datasets for comparison with state-of-the-art methods and to verify the efficiency of PCF algorithms on large datasets.

6. Experiments

Primarily, to verify the efficiency of the PCF algorithms in text categorization, we benefit from a real-world dataset, “The Moods of Bloggers”, that includes 157 blog posts written in four different moods, “cheerful, nervous, sad, and complicated” [48]. The attributes of the instances (feature vectors) are defined by the number of every word stem () existing in the document. That is to say, we study with a numerical dataset. The brief description of the dataset is given in Table 1. A desktop computer with Intel(R) Core(TM) i5-4460 CPU @ 3.20 GHz, 8 GB RAM, and 64-bit operating system is used in the experiments.

Algorithms 3 and 4 given in Section 4 were designed for binary classification so just to see how these algorithms work; we modified The Moods of Bloggers dataset as a binary dataset that includes two classes, “cheerful and others”. As is seen, a single change is made in the number of classes. The implementations are made on MATLAB (multiparadigm numerical computing environment). The obtained results in terms of running times, accuracy, and F-measure are given in Table 2. Time shows the running time of the algorithm in seconds and accuracy value is determined as the ratio between the number of correct labeled points of the dataset and the number of the points in the whole dataset as follows [43]:cc: number of correct classified points of the datasette: number of instances of the dataset

F-measure is the harmonic mean of precision and recall. Precision represents the proportion of predictive positive cases that are real positives and recall is the proportion of actual positive cases that were correctly predicted. These measures are presented as follows [49]:

As is seen in Table 2, Algorithm 4 is more efficient than Algorithm 3 with regard to the running time. Clustering algorithm that is added to the initialization step decreases running time by reducing the step size that is required for finding the center points of polyhedral conic functions and correlatively number of solved linear programming subproblems. Accuracy value, %100, is obtained on both of the algorithms since PCF algorithm (Algorithm 3) ends after a finite number of iterations and the function R defined in the linear programming subproblem strictly separates the sets A and B. This theorem is proved in [8]. But it is clear that, according to the used dataset, obtained accuracy value in Algorithm 4 can be lower than Algorithm 3 because of using misclassifications for both of the classes.

Most of text categorization problems are multiclass classification problems; in other words, they are formed with more than two categories, so we utilize Algorithms 5 and 6 which are expressed in Section 5. As given in Table 1, The Moods of Bloggers dataset is suitable for these multiclass classification algorithms. Results obtained are given in Table 3.

As is seen in Table 3, Algorithms 5 and 6 are not so different from each other regarding accuracy and running time. Running times are close values since we use clustering algorithm in both of the methods.

We use training and testing terms in Tables 4 and 5 as performance metrics. Here, training term is the same as accuracy since we make training and testing on the same dataset. But testing term is a more reliable performance metric that we obtain by implementing cross-validation. We utilize tenfold cross-validation for a better comparison between PCFs and state-of-the-art methods. In tenfold cross-validation, the dataset D is randomly split into 10 mutually exclusive subsets (the folds) D1, D2,..,D10 of approximately equal size. The inducer is trained and tested 10 times; each time , it is trained on and tested on [50]. The presented testing value in Tables 4 and 5 is the mean value of 10 different accuracy values that is obtained by cross-validation. That is why the test results are not so high as in training results.

In Tables 4 and 5, respectively, for binary and multiclass classification, expressed algorithms are compared with the other state-of-the-art classification algorithms (Naive Bayes, classification via regression, J48 (decision tree)) by using WEKA (Waikato Environment for Knowledge Analysis), in terms of 10-fold cross-validation. In PCF algorithms, the best test values are obtained in Algorithms 4 and 6 since misclassifications for both classes are used in these algorithms. This constraint does not allow overfitting the problem. When we compare PCF algorithms with the others regarding test values, Algorithms 4 and 6 are more efficient than the other state-of-the-art methods except classification via regression.

Besides a detailed experiment on Moods of Bloggers dataset, we make implementations on real-world text datasets available in UCI (Machine Learning Repository). The datasets are represented by vectorial using and the attribute types are real or integer. Each attribute corresponds to a precise word or stem in the entire dataset vocabulary. TF_IDF formula is used as term weighting. These processes are expressed in detail in Section 2. The other details of used datasets are given in Table 6 and they are explained as follows.

Burst Header Packet (BHP). Burst Header Packet flooding attack on Optical Burst Switching (OBS) Network Data Set includes 1075 instances with 22 attributes. The last attribute stands for the classes as NB-No Block, Block, No Block, and NB-Wait [51].

CNAE-9. CNAE-9 dataset contains 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. This dataset is highly sparse (99.22% of the matrix is filled with zeros) [52].

Turkish Text Categorization (TTC). Turkish text categorization dataset is a collection of Turkish news and articles including categorized 3,600 documents from 6 well-known portals in Turkey [53].

DBWorld E-Mails. DBWorld e-mails dataset contains 64 e-mails which are manually collected from DBWorld mailing list. They are classified as “announces of conferences” and “everything else”. Each attribute corresponds to a precise word or stem in the entire dataset vocabulary [54].

Obtained accuracy and time results are presented in Table 7. “-” is used for out of memory message in MATLAB. When we comment on the results we can say that Algorithms 5 and 6 are not so effective in terms of running times but it should not be forgotten that they are implied on MATLAB (a software environment) not in WEKA (a machine learning software). When we compare the accuracy results, we can say that Algorithm 5 is better than the others on composing good separator functions between classes.

7. Conclusion

In this paper, supervised classification via polyhedral conic functions is used to solve text classification problems. Binary and multiclass classification algorithms via PCFs are proposed and numerical experiments are done by implementing both of the proposed algorithms on a real-world dataset, called “The Moods of Bloggers”. For performance metric, accuracy, running time, and tenfold cross-validation results are used. The obtained consequences are shown in tables. Besides, to augment the experiments and comparison with state-of-the-art methods, same work is done on four real-world text datasets available in UCI (Machine Learning Repository). If we comment on the results, we can say that classification algorithms via polyhedral conic functions are usable for text classification as well as other state-of-the-art algorithms. For future studies, these algorithms can be experienced by different structured text datasets on more effective software programs.

Data Availability

The real-world datasets supporting the conclusions of this article are available in the UCI repository [http://archive.ics.uci.edu/ml/index.php]. “The Moods of Bloggers” dataset supporting the conclusions of this article is available in Kemik Natural Language Processing Group Datasets [http://www.kemik.yildiz.edu.tr/?id=28].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

All authors participated in every phase of research conducted for this paper. All authors read and approved the final manuscript.

Acknowledgments

Dr. Burak Ordin acknowledges TUBITAK for its support (Project no. 113E763).