Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling

Onan, Aytuğ

doi:https://doi.org/10.1155/2018/2497471

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Related Work Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Special Issue

Soft Computing for Analysis of Biomedical Data

View this Special Issue

Research Article | Open Access

Volume 2018 | Article ID 2497471 | https://doi.org/10.1155/2018/2497471

Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling

Aytuğ Onan¹

Academic Editor: Federico Divina

Received20 Jan 2018

Revised29 May 2018

Accepted31 May 2018

Published22 Jul 2018

Abstract

Text mining is an important research direction, which involves several fields, such as information retrieval, information extraction, and text categorization. In this paper, we propose an efficient multiple classifier approach to text categorization based on swarm-optimized topic modelling. The Latent Dirichlet allocation (LDA) can overcome the high dimensionality problem of vector space model, but identifying appropriate parameter values is critical to performance of LDA. Swarm-optimized approach estimates the parameters of LDA, including the number of topics and all the other parameters involved in LDA. The hybrid ensemble pruning approach based on combined diversity measures and clustering aims to obtain a multiple classifier system with high predictive performance and better diversity. In this scheme, four different diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) among classifiers of the ensemble are combined. Based on the combined diversity matrix, a swarm intelligence based clustering algorithm is employed to partition the classifiers into a number of disjoint groups and one classifier (with the highest predictive performance) from each cluster is selected to build the final multiple classifier system. The experimental results based on five biomedical text benchmarks have been conducted. In the swarm-optimized LDA, different metaheuristic algorithms (such as genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are considered. In the ensemble pruning, five metaheuristic clustering algorithms are evaluated. The experimental results on biomedical text benchmarks indicate that swarm-optimized LDA yields better predictive performance compared to the conventional LDA. In addition, the proposed multiple classifier system outperforms the conventional classification algorithms, ensemble learning, and ensemble pruning methods.

1. Introduction

The immense quantity of biomedical text documents can serve as an essential source of information for biomedical research. Biomedical text documents are characterized by an immense quantity of unstructured and sparse information in a wide range of forms, such as scientific articles, biomedical datasets, and case reports. Text mining aims to identify valuable information from unstructured text documents with the use of tools and techniques from several disciplines, such as machine learning, information retrieval, and computational linguistics. The use of text mining is one of the most promising tools in the biomedical domain that has attracted a lot of research interest. Text mining in biomedical domain can be successfully applied in a wide range of applications, including identification of disease-specific knowledge [1], diagnosis, treatment, and prevention of cancer [2], identification of obesity status of patients [3], identification of risk factors for heart disease [4], annotation of gene expression [5], and identification of drug targets and candidates [6].

Biomedical text mining follows the same stages (namely, format conversation, tokenization, stop word removal, normalization, stemming, dictionary construction, and vector space construction) utilized in the text processing from other domains [7]. To build accurate classification schemes on text documents, one pivotal issue is to identify an appropriate representation model for the documents [8]. The vector space model (also known as term vector model) is one of the most commonly employed representation schemes to process text documents, owing to its simple structure [9]. In this model, each text document is represented as vectors of identifiers (index terms). The vector space model suffers from high dimensional feature space, irrelevancy, and sparsity of features. Since each document is represented as a bag of words with the corresponding frequencies, words are regarded as statistically independent. Hence, word order is not taken into consideration [10].

Considering the limitations of the vector space model and the high dimensional unstructured nature of biomedical text documents, there are a number of representation schemes (such as the latent semantic analysis, the probabilistic latent semantic analysis, and the latent Dirichlet allocation) employed to process biomedical text documents [7]. The latent semantic analysis (LSA) is a scheme to extract and represent the contextual meaning of words with the use of statistical computations utilized on a large amount of text [11]. LSA can represent the semantic relations within the text. It can find the latent classes, while reducing the dimensionality of vector space model [12]. However, LSA has no strong statistical foundation and can suffer from high mathematical complexity [13]. The probabilistic latent semantic analysis (PLSA) is a statistical method for analysis of data which is based on a latent class model. PLSA has a strong statistical foundation. It can find latent topics and it can yield better performance compared to LSA [13].

The latent Dirichlet allocation (LDA) is an efficient generative probabilistic topic model, where each document is represented as a random mixture of latent topics. LDA can find latent topics, reduce the high dimensionality of vector space model, and can outperform other linguistic representation schemes, such as latent semantic analysis and probabilistic latent semantic analysis [14]. LDA involves several parameter values, such as the number of topics, the number of iterations for Gibbs sampling, parameter to control the topic distribution per document, and parameter to model distributions of terms per topic (Panichella et al., 2003). For unstructured text documents, information about the document-wise content and number of relevant topics is not known in advance (Zhao et al., 2005). Hence, the identification of an appropriate value for the number of topics is a challenging problem for unstructured text documents. An insufficient or excessive number of topics can degrade the predictive performance of machine learning algorithms built on LDA-based topic modelling. In addition to the number of topics, LDA requires several other parameters. Therefore, finding an optimal configuration for LDA-based topic modelling involves extensive empirical analysis with different configurations.

In order to build robust classification schemes, multiple classifier systems (also known as ensemble classifiers) have been widely employed in the field of pattern recognition, owing to its remarkable improvement in generalization ability and predictive performance [15]. There are three main stages of the ensemble learning process, namely, ensemble generation, ensemble pruning, and ensemble combination [16, 17]. The ensemble generation stage is the phase, in which base learning algorithms to be utilized in the multiple classifier system are generated. The base learning algorithms can be generated either homogeneously or heterogeneously. The ensemble combination stage seeks to integrate the individual predictions of base learning algorithms. The ensemble pruning stage aims to identify an optimal subset of base learning algorithms from the ensemble to enhance the predictive performance and computational efficiency. It has been empirically validated that ensemble pruning can yield more robust classification schemes [18].

Considering these issues, we propose a multiple classifier approach to biomedical text categorization based on swarm-optimized topic modelling and ensemble pruning. In the presented scheme, swarm-optimized approach is employed to estimate the parameters of LDA, including the number of topics and all the other parameters involved in LDA. Motivated by the success of hybrid ensemble pruning schemes [19–21], the proposed approach combines diversity measures and clustering. In this scheme, four different diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) are computed to capture the diversities within the ensemble. Based on these diversity measures, a combined diversity matrix is obtained. Based on this matrix, a swarm intelligence based clustering algorithm partitions the classification algorithms into a number of disjoint groups and one algorithm (with the highest predictive performance) from each cluster is selected to build the multiple classifier system. In the empirical analysis, five biomedical text benchmarks have been utilized. In the swarm-optimized LDA, different metaheuristic algorithms (such as genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are considered. In addition, five different metaheuristic clustering algorithms are considered in the ensemble pruning stage. The empirical analysis on biomedical text benchmarks indicates that swarm-optimized LDA yields better predictive performance compared to the conventional LDA. In addition, the proposed hybrid ensemble pruning scheme outperforms the conventional classification algorithms and ensemble learning methods.

The main contributions of our proposed categorization scheme can be summarized as follows:(i)We introduced a metaheuristic approach to optimize the set of parameters utilized in LDA-based topic modelling. In this regard, the number of topics (k), the number of Gibbs iterations (n), parameter to control the topic distribution per document, and parameter to model distributions of terms per topic are considered. We conducted several experiments on swarm-optimized LDA with different metaheuristic algorithms (namely, genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm). To the best of our knowledge, this is the first comprehensive empirical analysis devoted to metaheuristic algorithms on LDA-based topic modelling.(ii)We introduced an ensemble pruning approach based on combined diversity measures and metaheuristic clustering. To the best of our knowledge, this is the first study in ensemble pruning, which utilizes metaheuristic clustering algorithms to obtain diversified base learning algorithms.(iii)The presented classification scheme, which integrates swarm-optimized LDA-based modelling with the hybrid ensemble pruning scheme, is employed on biomedical text categorization. To the best of our knowledge, this is the first comprehensive study on LDA-based topic modelling and ensemble pruning on biomedical text categorization.

The rest of this paper is structured as follows. In Section 2, related work on topic modelling and multiple classifier systems have been presented. Section 3 presents the theoretical foundations, Section 4 presents the proposed text categorization framework, Section 5 presents the experimental results, and Section 6 presents the concluding remarks.

This section presents the related work on topic modelling and multiple classifier systems in biomedical text categorization.

2.1. Related Work on Topic Modelling

Topic modelling models have been successfully employed to summarize large-scale collections of text documents. Probabilistic topic modelling methods can be utilized to identify the core topics of text collections. In addition, topic modelling schemes can be utilized in a variety of tasks in computational linguistics, such as analysis of source code documents [23], summarizing opinions of product reviews [24], identification of topic evolution [25], aspect detection in review documents [26], analysis of Twitter messages [27], and sentiment analysis [28, 29].

Probabilistic topic modelling has attracted the attention of researchers on biomedical domain. Biomedical text collections suffer from high dimensionality and topic modelling methods are effective tools to handle with large-scale collections of documents. Hence, topic modelling can yield promising results on biological and biomedical text mining [30]. For instance, Wang et al. [31] presented a probabilistic topic modelling scheme to identify protein-protein interactions from the biological literature. In this scheme, the correlation between different methods and related words is modelled in a probabilistic way to extract the detection methods. In another study, Arnold et al. [32] utilized the latent Dirichlet allocation method to identify relevant clinical topics and to structure clinical text reports. Song and Kim [33] employed the latent Dirichlet allocation method to conduct bibliometric analysis on bioinformatics from full-text text collections of PubMed Central articles. In another study, Sarioglu et al. [34] utilized topic modelling to represent clinical reports in a compact way, so that these collections can be efficiently processed. In another study, Bisgin et al. [35] applied topic modelling to drug labelling, which is a human-intensive task with many ambiguous semantic descriptions. In this way, manual annotation challenges can be eliminated. Likewise, Wang et al. [36] introduced a topic modelling based scheme to identify literature-driven annotations for gene sets. In this scheme, the number of topics to be utilized in topic modelling is empirically inferred through the analysis with various parameter values (5, 10, 15, 20, etc.) for the number of topics. In another study, Bisgin et al. [37] employed the latent Dirichlet allocation based topic modelling to identify interdependencies between cellular endpoints. The experimental analysis indicated that LDA can substantially enhance the understanding of systems biology. Probabilistic topic modelling has also been employed to identify drug repositioning strategies [38]. Wang et al. [39] utilized topic modelling to analyze 17,723 abstracts from PubMed publications related to adolescent substance use and depression. In this study, topic modelling was employed to identify the literature and to capture other relevant topics. In another study, Wang et al. [40] presented a topic modelling based scheme to mine biomedical text collections. In this scheme, topic modelling was employed as a fine-grained preprocessing model. Recently, Sullivan et al. [41] utilized topic modelling to identify unsafe nutritional supplements from review documents. In another study, Chen et al. [42] employed probabilistic topic modelling to represent hospital admission processes in a compact way.

2.2. Related Work on Multiple Classifier Systems

Multiple classifier systems have been successfully employed in a wide range of applications in pattern recognition, including biomedical domain. Empirical analysis on multiple classifier systems indicates that ensemble pruning can enhance the predictive performance of multiple classifier systems [18]. Ensemble pruning approaches can be mainly divided into five groups, as exponential search, randomized search, sequential search, ranking-based, and clustering based methods [16]. Exponential approaches to ensemble pruning seek to examine all possible subsets of base learning algorithms within the multiple classifier system. For instance, Aksela [43] examined the predictive performance of several evaluation metrics (namely, correlation between errors, Q-statistics, and mutual information) in ensemble pruning. Randomized approaches to ensemble pruning aim to explore the search space of candidate classifiers with the use of metaheuristic algorithms. A wide range of metaheuristics, such as genetic algorithms, tabu search, and population based incremental learning, have been successfully utilized for ensemble pruning [44, 45]. For instance, Sheen and Sirisha [46] introduced an ensemble pruning scheme for malware detection based on harmony search. Likewise, Mendialdua et al. [47] utilized the estimation of distribution algorithm for ensemble pruning. In sequential search based methods, the search space of candidate classifiers has been explored in forward, backward, or forward-backward direction. For instance, Margineantu and Dietterich [48] introduced a sequential approach for ensemble pruning based on reduced error pruning with back-fitting. Similarly, Caruana et al. [49] presented a forward stepwise selection based approach for ensemble pruning. Recently, Dai et al. [50] introduced a reverse reduced error-based ensemble pruning algorithm based on subtraction operation. Ranking-based approaches to ensemble pruning aim to identify an optimal subset of classifiers based on a ranking obtained by a particular evaluation measure. For instance, Kotsiantis and Pintelas [51] presented a t-test based ranking scheme for ensemble pruning. More recently, Galar et al. [52] presented an ordering-based metric for ensemble pruning. Clustering based approaches to ensemble pruning partition the base learning algorithms of ensemble into clusters. For instance, Zhang and Cao [53] presented a spectral clustering based algorithm for ensemble pruning. In this scheme, the base learning algorithms were grouped into two clusters based on predictive performance and diversity. Then, one cluster of ensemble was pruned and one cluster of ensemble was retained as the pruned subset of classifiers.

2.3. Motivation and Contribution of the Study

As outlined in advance, probabilistic topic modelling methods are essential tools to identify hidden topics in large-scale collections of text documents. In order to enhance the performance of LDA, there are a number of extensions on the basic model. For instance, Griffiths and Tenenbaum [54] introduced a hierarchical latent Dirichlet allocation model. In this model, topic distributions are identified from hierarchies of topics, where each hierarchy is modelled by a nested Chinese restaurant process. Each node of tree corresponds to a particular topic, where each topic is associated with a distribution. In another study, Teh et al. [55] presented a hierarchical latent Dirichlet allocation scheme, in which parameter value for the number of topics is inferred through the use of posterior inference. Grant and Cordy [56] introduced a heuristic approach to estimate the number of topics in source code analysis. In another study, Panichella et al. [57] presented a genetic algorithm based scheme to identify optimal configurations for latent Dirichlet allocation. In this scheme, parameter set for topic modelling was estimated with the use of genetic algorithm. The presented scheme was employed on three different tasks of software engineering, namely, traceability link recovery, feature location, and software artifact labelling. Likewise, Zhao et al. [58] introduced a heuristic approach to estimate the appropriate number of topics for latent Dirichlet allocation. In this scheme, the appropriate number of topics is identified through the use of ratio for perplexity change. Recently, Karami et al. [59] presented a fuzzy approach to topic modelling. In this scheme, fuzzy clustering was employed to identify optimal number of topics.

In addition to the aforementioned five ensemble pruning approaches, hybrid methods have attracted research attention in the pattern recognition. Hybrid approaches to ensemble pruning seek to integrate several ensemble pruning paradigms. For instance, Lin et al. (2014) introduced a hybrid ensemble pruning algorithm which integrates k-means clustering and dynamic selection. Similarly, Mousavi and Eftekhari [60] presented a hybrid ensemble pruning scheme which integrates static and dynamic ensemble selection with NSGA-II multiobjective genetic algorithm. In another study, Cavalcanti et al. [21] presented a hybrid ensemble pruning algorithm based on genetic algorithm and graph coloring. In this scheme, several different diversity measures (such as Q-statistics, correlation coefficient, Kappa statistics, and double fault measure) are combined via a genetic algorithm. Similarly, Onan et al. [19, 20] introduced a hybrid ensemble pruning algorithm based on consensus clustering and multiobjective evolutionary algorithm. In this scheme, classifiers are assigned into clusters based on their predictive performance and the set of candidate classifiers are explored through the use of evolutionary algorithm.

Recent studies on topic modelling indicate that the identification of an appropriate parameter value for the number of topics is an essential task to build robust classification schemes. In addition, hybrid ensemble pruning schemes can outperform conventional classifiers, ensemble learning methods, and ensemble pruning methods. Through their potential use on text classification, the number of works that utilize metaheuristic algorithms to optimize parameters of LDA and the number of works that utilize ensemble pruning schemes are very limited. To fill this gap, this paper presents a classification scheme based on swarm-optimized topic modelling and hybrid ensemble pruning for text categorization.

3. Theoretical Foundations

This section summarizes the theoretical foundations of the study. Namely, the latent Dirichlet allocation method, swarm-based optimization algorithms, ensemble learning methods, ensemble pruning methods, cluster validity indices, and pairwise diversity measures are presented.

3.1. The Latent Dirichlet Allocation

The latent Dirichlet allocation model (LDA) is a widely employed generative probabilistic model to identify the latent topics in text documents [22]. In LDA, each document is represented as a random mixture of latent topics and each topic is represented as a mixture of words. The mixture distributions are Dirichlet-distributed random variables to be inferred. In this scheme, each document exhibits the topics in different proportions, each word in each document is drawn among the topics, and topics are chosen based on per-document distribution over topics [61]. LDA attempts to determine the underlying latent topic structure based on the observed data. In LDA, the words of each document correspond to the observed data. For each document in the corpus, words are obtained by following a two-staged procedure. Initially, a distribution over topics is randomly chosen for each word of the document [22]. In LDA, a word is a discrete data from a vocabulary indexed by , a sequence of N words =(, , …, ), and a corpus is a collection of M documents denoted by D=. The generative process of LDA is summarized in Box 1.

LDA process can be modelled by a three-level Bayesian graphical model, as given in Figure 1. In this graphical model, nodes are used to represent random variables and edges are used to denote the possible dependencies between the variables. In this notation, refers to Dirichlet parameter, Θ refers to document-level topic variables, z refers to per-word topic assignment, refers to the observed word, and β indicates the topics [61].

Based on this notation, the generative process of LDA corresponds to a joint distribution of the hidden and observed variables. The probability density function of a k-dimensional Dirichlet random variable is computed as given by (1), the joint distribution of a topic mixture is computed as given by (2), and the probability of a corpus is computed as given by (3) [22]:In LDA, the computation of the posterior distribution of the hidden variables is an important inferential task. The exact inference of hidden variables is exponentially large. Hence, approximation algorithms (such as Laplace approximation, variational approximation, and Gibbs sampling) have been utilized in LDA process [61].

3.2. Ensemble Learning Methods

Ensemble learning methods aim to combine the predictions of multiple classification algorithms so that a classification model with higher predictive performance can be achieved [62]. In dependent methods, the outputs of former classifiers determine the outputs of following classifiers. In contrast, the outputs of classifiers are individually identified and combined to produce the final prediction in independent methods. Dependent ensemble methods include Boosting (e.g., AdaBoost algorithm) and independent methods include Bagging, Dagging, and Random Subspace. To examine the predictive performance of the proposed scheme, four well-known ensemble learning methods (namely, AdaBoost [63], Bagging [64], Random Subspace [65], and Stacking [66]) are considered.

3.3. Ensemble Pruning Methods

The ensemble pruning methods aim to identify optimal subset of classification algorithms to improve the predictive performance and computational efficiency of multiple classifier systems. To examine the predictive performance of proposed ensemble pruning algorithm, we have employed four ensemble pruning algorithms. These methods are the ensemble pruning methods from libraries of models [49], Bagging ensemble selection [67], LibD3C algorithm [68], and ensemble pruning based on combined diversity measures [21].

3.4. Swarm-Based Optimization Algorithms

Swarm-based optimization algorithms, including genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm, have been successfully employed on applications of data science, such as data clustering and data categorization [68]. In the proposed scheme, swarm-based optimization algorithms have been utilized to optimize the set of parameters of LDA-based topic modelling. In addition, the proposed ensemble pruning algorithm employs swarm-based optimization algorithms to group classifiers into clusters. In the empirical analysis, genetic algorithms [69], particle swarm optimization algorithm [70], firefly algorithm [71], cuckoo search algorithm [72], and bat algorithm [73] are utilized.

3.5. Cluster Validity Indices

This section briefly introduces four cluster validity indices (namely, the Bayesian information criterion, Calinski-Harabasz index, Davies-Bouldin index, and Silhouette index), which are utilized to evaluate the clustering quality of different configurations of LDA.

The Bayesian information criterion (BIC) is computed as given below:where n denotes the number of topics, L denotes the likelihood of parameters to generate data in the model, and v denotes the number of free parameters in Gaussian model [74]. The smaller the Bayesian information criterion, the better the generated model.

The Calinski-Harabasz index (CH) is the ratio of the traces of between cluster scatter matrix and the internal scatter matrix, which is computed as given below [74]:where K denotes the number of clusters, N denotes the number of data instances, denotes the number of elements in cluster C_k, x_i denotes a point within cluster C_k, B denotes the between-cluster scatter matrix, which represents the error sum of squares between different clusters, and W denotes the internal scatter matrix, which represents the squared differences of instances in a cluster. Here, trace of an n-by-n square matrix corresponds to the sum of the elements on the main diagonal [75].

The Davies-Bouldin index (DB) is a cluster validity index, which aims to maximize between-cluster distance and to minimize the distance between centroids of clusters and the other data points, that is defined as given by the following equation:where c denotes the number of clusters, i and j correspond to cluster labels, d(c_i, c_j) corresponds to distance between centroids of clusters, and corresponds to a data point within cluster C_i. The smaller the DB criterion, the better the generated model.

The Silhouette index (SI) is defined as given by (9):where N denotes the number of clusters, denotes the size of cluster C_i, a(x) denotes the average distance between the ith instance and all instances in X_j, b(x) denotes the minimum distance from i to the centroids of clusters not containing i.

3.6. Pairwise Diversity Measures

This section briefly introduces four diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure) which are utilized in the proposed ensemble classification scheme.

Q-statistics, the correlation coefficient (p_i,k), the disagreement measure (Dis), and the double fault measure (DF) among two classifiers D_i and D_k are computed using (12), (13), (14), and (15), respectively [76]:where N¹¹, N⁰⁰, N¹⁰, and N⁰¹ denote the number of correctly classified instances by the two classifiers, the number of incorrectly classified instances by the two classifiers, the number of instances correctly classified by D_i and incorrectly classified by D_k, and the number of instances correctly classified by D_k and incorrectly classified by D_i, respectively.

4. The Proposed Text Categorization Framework

The proposed text categorization framework combines the swarm-optimized Latent Dirichlet allocation and diversity-based hybrid ensemble pruning scheme. The rest of this section explains the methods utilized in the proposed biomedical text categorization framework.

4.1. Swarm-Optimized Latent Dirichlet Allocation

The latent Dirichlet allocation (LDA) is an efficient generative probabilistic model that can be employed to represent unstructured text documents in an efficient way. In general, LDA-based topic modelling involves the calibration of several parameters, summarized as follows:(i)Number of topics in LDA-based topic modelling (k).(ii) parameter to control the topic distribution per document. A higher value for parameter denotes better smoothing of topics for each document.(iii) parameter to model distributions of terms per topic.

In order to improve the computational complexity of LDA, LDA is usually employed in conjunction with an approximation method. In this work, we utilized Gibbs sampling method in conjunction with LDA. In this way, the number of iterations (N) for sampling is also involved as an additional parameter value. Identifying appropriate parameter values of LDA with the optimal configuration is a challenging task. Without setting appropriate parameter values, LDA-based representation may degrade the predictive performance of classification schemes. Too low or too much number of topics can result in a poor predictive performance. Hence, finding an optimal configuration for LDA-based topic modelling involves extensive empirical analysis. Exhaustively enumerating possible parameter values for LDA to identify an optimal configuration involves high computational analysis with a wide range of parameter values.

In this paper, five metaheuristic algorithms (namely, genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are utilized to calibrate the parameters of LDA. In this scheme, values of all parameters of LDA are taken into consideration. Hence, various values for each parameter are evaluated to find an optimal configuration. In the presented problem, the first issue is to examine the merit of a particular LDA-based configuration. In order to evaluate the merit of a particular configuration of LDA before employing on a particular task, we have employed four internal cluster validity indices, namely, the Bayesian information criterion, Calinski-Harabasz index, Davies-Bouldin index, and Silhouette index. Higher clustering quality of a particular LDA-based configuration tends to yield higher predictive performance on LDA-based categorization tasks [19, 20]. For this reason, we seek to identify an LDA configuration which maximizes the overall clustering quality of LDA configuration.

Since exhaustively enumerating possible configurations for LDA can be computationally infeasible task, the identification of a parameter set which maximizes the overall clustering quality can be modelled as an optimization problem. In the presented scheme, five swarm-based optimization algorithms (namely, genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) have been considered. The presented approach seeks to find an LDA configuration [k, , , N] which maximizes the clustering quality in terms of internal cluster validity indices (Bayesian information criterion, Calinski-Harabasz index, Davies-Bouldin index, and Silhouette index). The presented scheme starts with a randomly generated population of initial configuration. Then, randomly generated LDA configurations are utilized to cluster text documents. The merit of clusters is evaluated using four internal clustering validity indices and the swarm-based optimization algorithms have been utilized to optimize the parameter values. In Figure 2, the general structure of swarm-optimized LDA is summarized.

4.2. Diversity-Based Ensemble Pruning

Diversity-based ensemble pruning approach is a hybrid ensemble pruning scheme, which integrates combined pairwise diversity measures and swarm-based clustering algorithms. The presented ensemble pruning method consists of two main stages, namely, computation of pairwise diversity matrices among the base learning algorithms of the ensemble and swarm-based clustering on combined pairwise diversity matrix to obtain final base learning algorithms of the pruned ensemble.

The general structure of diversity-based ensemble pruning algorithm is presented in Figure 3. Initially, many different base learning algorithms (classification algorithms) from the model library with varying parameter values have been taken as the initial set of classifiers. The model library contains classification algorithms from five groups, namely, five Bayesian classifiers, fourteen function based classifiers, ten instance based classifiers, three rule based classifiers, and eight decision tree classifier which have been considered. The detailed description regarding the classification algorithms of the model library is presented in Table 2. Classification algorithms of the model library have been trained on the training set. In this way, the predictive characteristics of different learning algorithms have been obtained.

After training classification algorithms, pairwise diversity matrices are computed. The diversity and accuracy are two essential factors to build multiple classifier systems with high predictive performance. There are many pairwise and nonpairwise diversity measures presented in the literature. Different diversity measures concentrate on different aspects of the diversity and there is not a widely accepted definition for the term. Motivated by the success of the combined diversity measures in the ensemble pruning [21], we seek to find an appropriate subset of diversity measures. In this regard, we have conducted an experimental analysis with five widely utilized diversity measures (namely, Q-statistics, correlation coefficient, disagreement measure, double fault measure, and kappa statistics). Since there are five diversity measures, we have evaluated 2⁵-1=31 different subset cases. The values obtained for each measure are normalized. Since the highest predictive performance is obtained by averaging the four diversity measures (Q-statistics, correlation coefficient, disagreement measure, and double fault measure), this configuration is utilized in the proposed ensemble pruning. For four pairwise diversity measures mentioned above, the diversity values of each pair of classifiers are computed using the validation set. Then, the combined pairwise diversity matrix is obtained from the four pairwise diversity matrices by averaging the diversity values of the individual diversity matrices.

After computation of the combined pairwise diversity matrix, clustering has been employed on the combined diversity matrix. Clustering has been widely employed technique for ensemble pruning, which aims to group classification algorithms into clusters such that the classifiers with the similar characteristics are assigned into the same cluster. By obtaining classifiers from the different clusters, a multiple classifier system with high diversity can be achieved. In this study, five metaheuristic clustering algorithms (namely, genetic algorithm based clustering, particle swarm clustering, firefly clustering, cuckoo clustering, and bat clustering) have been employed on the combined diversity matrix. Based on the clustering results, the classification algorithms have been assigned into a number of clusters.

On the empirical analysis with five metaheuristic clustering algorithms, the highest predictive performance is achieved by firefly clustering algorithm. Hence, we utilized firefly clustering scheme to cluster classification algorithms on the combined diversity matrix based on their predictive characteristics. Let A denote an agent that consists of m n-dimensional points, denote n-dimensional points in A, P denote a set containing of l n-dimensional points, p_i denote n-dimensional point contained in P, and Dist(A,P) denote the distance between A and p; the general structure of firefly clustering algorithm utilized in the proposed scheme is outlined in Box 2.

After applying clustering algorithm on the combined pairwise diversity matrix, clustering results are utilized to select the classifiers of the pruned ensemble. In order to do so, classifiers of each cluster are ranked based on their predictive performance (in terms of classification accuracy). Then, one classifier with the highest predictive performance is selected from each cluster. Let N denote the number of clusters obtained at the end of firefly clustering algorithms, and one classifier has been selected from each classifier. In this way, N classifiers constitute the pruned ensemble. In order to combine the predictions of the selected classifiers, majority voting scheme is employed.

5. Experimental Analysis

In order to examine the predictive performance of the proposed biomedical text categorization scheme, an extensive empirical analysis has been performed. This section presents the datasets utilized in the analysis, the experimental procedure, and the experimental results.

5.1. Dataset

The experimental analysis has been conducted on five public biomedical text categorization datasets. These datasets are Oh5 collection, Oh10 collection, Oh15 collection, Ohscal collection, and Ohsumed-400 collection [77]. Oh5, Oh10, Oh15, Ohscal, and Ohsumed-400 collections are part of OHSUMED collection. Each collection contains biomedical text collections. The basic descriptive information about biomedical text collections utilized in the empirical analysis has been summarized in Table 1, and the number of terms extracted after preprocessing is given.

5.2. Evaluation Metrics

In order to evaluate the predictive performance of the presented biomedical text categorization scheme, classification accuracy (ACC) and F-measure have been employed as the evaluation measure.

Classification accuracy is one of the most widely utilized measures in performance evaluation of classification algorithms. It is the proportion of the number of true positives and true negatives obtained by the classifiers in the total number of instances as given by the following equation: where TN, TP, FP, and FN represent the number of true negatives, true positives, false positives, and false negatives, respectively.

F-measure is another common measure in performance evaluation of classification algorithms. F-measure is the harmonic mean of the precision and recall of a classification algorithm. It can take values between 0 and 1 and the higher values of F-measure indicate a better predictive performance. Based on the characteristics of datasets utilized in the empirical analysis, there are two variants of F-measure, namely, micro-averaged F-measure and macro-averaged F-measure. The micro-averaged F-measure extends F-measure to multiclass problems by averaging precision and recall values across all classes. However, F-measure and micro-averaged F-measure cannot focus entirely on rare classes [78]. Since some of the datasets utilized in the empirical analysis are imbalanced dataset, the macro-averaged F-measure is also utilized as another evaluation measure. The macro-averaged F-measure, which determines the average F-measure across all one-versus-all classes, is computed as given by (17):where TP, FP, and FN represent the number of true positives, false positives, and false negatives, respectively.

5.3. Experimental Procedure

In the experimental analysis, dataset is divided into tenfold (parts). In this scheme, sixfold is utilized for training, twofold is utilized for validation, and twofold is utilized for test. The experimental analysis is performed with the machine learning toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.9, which is an open-source platform with many machine learning algorithms implemented in Java [79]. The presented classification scheme is also implemented in Java. In the empirical analysis on swarm-based latent Dirichlet allocation, Naïve Bayes algorithm and support vector machines are utilized as the base learning algorithms. In order to compare the presented multiple classifier system, four well-known ensemble methods (namely, AdaBoost, Bagging, Random Subspace, and Stacking) have been considered. For AdaBoost, Bagging, and Random Subspace algorithms, Naïve Bayes and support vector machines are utilized as the base learners. In the Stacking (stacked generalization), the classifier ensemble consisted of five base learners (namely, Naïve Bayes, support vector machines, logistic regression, Bayesian logistic regression, and linear discriminant analysis). For ensemble selection from libraries of models (ESM) and Bagging ensemble selection (BES), the same model library presented in Table 2 has been utilized [19, 20].

For evaluating ensemble pruning schemes, we have adopted the scheme outlined in [19, 20]. In the experimental analysis, ESM, BES, and LibD3C algorithms are considered with different parameter values. For ESM algorithm, four different schemes (namely, forward selection, backward elimination, forward-backward selection, and the best model scheme) have been considered. In ESM algorithm, root mean squared error (RMSE), classification accuracy (ACC), ROC area, precision, recall, and F-measure are considered as the evaluation measures. For BES algorithm, different bag sizes ranging from 10 to 100 are considered. In this algorithm, root mean squared error (RMSE), accuracy (ACC), ROC area, precision, recall, F-measure, and the combination of all metrics are employed as the evaluation measures. For LibD3C algorithm, five different ensemble combination rules (namely, average of probabilities, product of probabilities, majority voting, minimum probability, and maximum probability) are considered. In the experimental analysis, the highest predictive performances obtained from these algorithms are reported. In Table 3, the parameter values of metaheuristic algorithms utilized in swarm-based LDA are presented. In Table 4, parameters of metaheuristic clustering algorithms utilized in the ensemble pruning stage are given. The parameters of the metaheuristic algorithms utilized in the swarm-based LDA stage and the parameters of the metaheuristic algorithms utilized in the ensemble pruning stage are determined based on the benchmark empirical results for the algorithms [80, 81].

5.4. Experimental Results and Discussion

The presented biomedical text categorization framework consists of two main stages, namely, swarm-optimized latent Dirichlet allocation stage and diversity-based ensemble pruning stage.

Swarm-optimized latent Dirichlet allocation stage aims to estimate the parameters of LDA. In the empirical analysis on LDA, five different metaheuristic algorithms (namely, genetic algorithms, particle swarm optimization, firefly algorithm, cuckoo search algorithm, and bat algorithm) are considered. To evaluate the clustering quality of different configurations of LDA, four internal cluster validity indices (namely, the Bayesian information criterion, Calinski-Harabasz index, Davies-Bouldin index, and Silhouette index) are considered. In addition, the proposed scheme presents an ensemble pruning based on combined diversity measures and metaheuristic clustering. In the tables, the highest (the best) results achieved by a particular configuration are indicated as both boldface and underline and the second best results are indicated as both boldface and italics.

In order to evaluate the merit of swarm-optimized topic modelling in LDA, Table 5 presents the classification accuracies obtained by different LDA-based configurations with Naïve Bayes and support vector machine classifiers. To verify the impact of ensemble pruning method in the presented scheme, Table 6 presents the classification results obtained by conventional algorithms, ensemble learning methods, conventional ensemble pruning methods, and the proposed diversity-based ensemble pruning method. For the results reported in Table 6, the biomedical text categorization datasets are represented with LDA (k=50); i.e., swarm-optimized latent Dirichlet allocation stage has not been applied for the results presented in Table 6 to examine the predictive performance of the proposed ensemble pruning scheme. Finally, Table 7 compares the predictive performance of conventional algorithms, ensemble learning methods, conventional ensemble pruning methods, and the proposed diversity-based ensemble pruning method when swarm-optimized latent Dirichlet allocation stage has been applied to represent the dataset.

As can be observed from the classification accuracies presented in Table 5, the performance of LDA-based representation schemes generally enhances with the use of metaheuristic algorithms in conjunction with LDA to estimate the parameters of it. Among the different metaheuristic algorithms, the highest predictive performance is obtained by bat algorithm based LDA with Davies-Bouldin index based evaluation. The second highest predictive performance is obtained by cuckoo search algorithm based LDA with Davies-Bouldin index based evaluation. Regarding the performance of different evaluation measures, the highest performance is achieved by Davies-Bouldin index based configurations. The second predictive performance is achieved by Silhouette index based configurations, which is followed by Calinski-Harabasz index based configurations. Regarding the performance of conventional LDA-based representation schemes, the highest predictive performance is generally achieved when k=50. The predictive performance patterns obtained by different LDA-based configurations with Naïve Bayes algorithm are valid for LDA-based configurations with support vector machines algorithm.

In the empirical analysis on the ensemble pruning, five swarm-based clustering algorithms (namely, genetic clustering, particle swarm-based clustering, firefly clustering, cuckoo clustering, and bat clustering) have been considered. Regarding the predictive performance obtained by conventional classification algorithms, support vector machines algorithm outperforms Naïve Bayes algorithm for the compared datasets. In addition, Bagging ensemble of Naïve Bayes algorithm yields better predictive performance compared to Naïve Bayes algorithm. In general, the predictive performance is enhanced with the use of conventional ensemble learning methods (namely, Bagging, AdaBoost, and Random Subspace algorithm). As can be seen from the results reported in Table 6, conventional ensemble pruning methods outperform the conventional classification algorithms and ensemble learning schemes. In addition, hybrid ensemble pruning schemes (the proposed diversity-based ensemble pruning method, LibD3C algorithm, and ensemble pruning based on combined diversity measures) outperform the other ensemble pruning schemes (ensemble selection from libraries of models and Bagging ensemble selection). The highest predictive performance is obtained by the proposed diversity-based ensemble pruning scheme with firefly clustering. The second highest predictive performance is generally obtained by the proposed diversity-based ensemble pruning scheme with cuckoo clustering.

Based on the extensive empirical analysis with different metaheuristic algorithms in swarm-based LDA and with different clustering algorithms in diversity-based ensemble pruning algorithm, the highest predictive performance is obtained by bat algorithm based LDA with Davies-Bouldin index and diversity-based ensemble pruning with firefly clustering. In Table 7, the predictive performance of the proposed biomedical text categorization scheme is compared with two classification algorithms (namely, Naïve Bayes algorithm and support vector machines), four ensemble methods (namely, Bagging, AdaBoost, Random Subspace, and Stacking), and four ensemble pruning methods (namely, ensemble selection from libraries of models, Bagging ensemble selection, LibD3C algorithm, and ensemble pruning based on combined diversity measures). For the results reported in Table 7, the biomedical text categorization datasets are represented with bat algorithm based LDA with Davies-Bouldin index (BA-LDA (DB)). As can be observed from the results outlined in Table 7, the proposed scheme outperforms the conventional classifiers, ensemble learning methods, and ensemble pruning methods.

In addition to classification accuracy, the predictive performances of classification algorithms, ensemble learning methods, and ensemble pruning methods have been also examined in terms of the macro-averaged F-measure. In Table 8, the macro-averaged F-measure results obtained by different LDA-based configurations with Naïve Bayes and support vector machine classifiers are presented. Regarding the macro-averaged F-measure results presented in Table 8, the highest predictive performance is obtained by bat algorithm based LDA with Davies-Bouldin index based representation. The same patterns obtained in terms of classification accuracies presented in Table 5 are also valid for F-measure based results. Hence, the utilization of metaheuristic optimization algorithms in conjunction with LDA to calibrate its hyper-parameters enhances the predictive model.

To examine the performance improvement achieved by the proposed ensemble pruning scheme, Table 9 presents the macro-averaged F-measure values obtained by conventional algorithms, ensemble learning methods, conventional ensemble pruning methods, and the proposed diversity-based ensemble pruning method. For the results reported in Table 9, the biomedical text categorization datasets are represented with LDA (k=50); i.e., swarm-optimized latent Dirichlet allocation stage has not been applied for the results presented in Table 9. Regarding the macro-averaged F-measure results presented in Table 9, the highest predictive performance is obtained by the proposed diversity-based ensemble pruning scheme with firefly clustering. The second highest predictive performance is generally obtained by the proposed diversity-based ensemble pruning scheme with cuckoo clustering and ensemble pruning based on combined diversity.

In Table 10, the macro-averaged F-measure results obtained by classification algorithms, ensemble learning methods, and ensemble pruning methods are presented. For the results reported in Table 10, the biomedical text categorization datasets are represented with bat algorithm based LDA with Davies-Bouldin index (BA-LDA (DB)). Regarding the macro-averaged F-measure results, the proposed scheme outperforms the conventional classifiers, ensemble learning methods, and ensemble pruning methods.

To statistically validate the results obtained in the empirical analysis, we have performed the two-way ANOVA (analysis of variance) test in the Minitab statistical program. The two-way ANOVA test is an extension of the one-way ANOVA test, which aims to evaluate the effect of two different categorical independent variables on one dependent variable. In two-way ANOVA test, both the main effect of each independent variable and their interactions are taken into assessment. The results for the two-way ANOVA test of overall results (in terms of classification accuracy) are presented in Table 11, where DF, SS, MS, F, and P denote degrees of freedom, adjusted sum of squares, adjusted mean square, F-Value, and probability value, respectively. Degrees of freedom are the amount of information in the data. The adjusted sum of squares term (SS) denotes the amount of variation in the response data that is explained by each term of the model. F-statistics (F) is the test statistic to identify whether a term is associated with the response and the probability value (P) is used to determine the statistical significance of the terms and model. The results presented in Table 11 are divided into three parts. The upper part of the table denotes the statistical analysis of results on the different LDA-based configurations, the middle part of the table denotes the statistical analysis of results on ensemble pruning, and the lower part of the table denotes the statistical analysis of results on conventional classifiers, ensemble learning methods, and ensemble pruning methods. For two-way ANOVA test, two different factors (different datasets and different algorithmic configurations) are taken as categorical independent variables. In addition, the interaction among these factors is also taken into consideration. According to the results presented in Table 11, probability value is P<0.001 for different factors and their interactions. Hence, there are statistically meaningful differences between the predictive performances of compared methods. The performance gain obtained by swarm-optimized LDA is statistically meaningful. Similarly, the performance gain obtained by the proposed ensemble pruning method is also statistically meaningful (P<0.001).

The results for the two-way ANOVA test of overall results (in terms of the macro-averaged F-measure values) are presented in Table 12. According to the results presented in Table 12, there are statistically meaningful differences between the predictive performances of compared methods (P<0.001).

In Figure 4, the confidence intervals for the mean values of classification accuracies obtained by the different LDA-based configuration schemes are presented. Similarly, in Figure 5, the confidence intervals for the mean values of classification accuracies obtained by the conventional classifiers, ensemble learners, and ensemble pruning methods are presented. For results depicted in Figure 5, the biomedical text categorization datasets are represented with LDA (k=50); i.e., swarm-optimized latent Dirichlet allocation stage has not been applied. In contrast, in Figure 6, the confidence intervals for the mean values of classification accuracies obtained by the conventional classifiers, ensemble learners, and ensemble pruning methods are given. In Figure 6, swarm-optimized latent Dirichlet allocation stage has been applied to represent the dataset. For the statistical significance of results, confidence intervals are divided into regions denoted by red dashed lines. As the interval plots indicate, the predictive performances obtained by the swarm-optimized LDA (BA-LDA (DB)) and DEP (firefly clustering) are statistically significant.

In Figure 7, average execution times of compared algorithms have been presented in seconds. As can be observed from Figure 7, average execution times on base learning algorithms (Naïve Bayes and support vector machines) are the lowest. Conventional ensemble learning methods generally enhance the predictive performance of the conventional base learning algorithms. However, ensemble learning methods involve more execution times. Compared to the ensemble learning methods, ensemble pruning schemes have more execution time. The highest execution time is involved in ensemble pruning based on combined diversity measures (CDM) and the second highest execution time is required in the proposed classification scheme (DEP-firefly clustering). Metaheuristic optimization methods are well-established techniques on tuning the parameters. Hence, there is a trade-off between predictive performance and execution times.

6. Conclusion

In this work, we propose a novel biomedical text classification scheme based on swarm-optimized latent Dirichlet allocation and diversity-based ensemble pruning. Biomedical text categorization is an important research direction due to the immense quantity of unstructured information available. The latent Dirichlet allocation (LDA) is a popular representation scheme for text documents, which can yield better performance than other linguistic representation schemes, such as latent semantic analysis and probabilistic latent semantic analysis. We found out that the identification of appropriate parameter values is very important to the performance of LDA. In addition, it has been experimentally validated that the use of metaheuristic optimization algorithms to calibrate the parameters of LDA yields promising results on biomedical text categorization. The presented text classification scheme also employs an ensemble pruning approach based on combined diversity measures to identify a robust multiple classifier system with high predictive performance. The presented ensemble pruning approach combines four different diversity measures (namely, disagreement measure, Q-statistics, the correlation coefficient, and the double fault measure). In addition, the scheme employs the swarm-based clustering algorithm. The experimental results indicate that the proposed multiple classifier system outperforms the conventional classification algorithms, ensemble learning, and ensemble pruning methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

E. S. Chen, G. Hripcsak, H. Xu, M. Markatou, and C. Friedman, “Automated acquisition of disease–drug knowledge from biomedical and clinical documents: an initial study,” Journal of the American Medical Informatics Association, vol. 15, no. 1, pp. 87–98, 2008.
View at: Publisher Site | Google Scholar
R. Rodriguez-Esteban, “Biomedical text mining and its applications,” PLoS Computational Biology, vol. 5, no. 12, Article ID e1000597, 2009.
View at: Publisher Site | Google Scholar
R. L. Figueroa and C. A. Flores, “Extracting Information from Electronic Medical Records to Identify the Obesity Status of a Patient Based on Comorbidities and Bodyweight Measures,” Journal of Medical Systems, vol. 40, no. 8, pp. 1–9, 2016.
View at: Publisher Site | Google Scholar
J. Urbain, “Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models,” Journal of Biomedical Informatics, vol. 58, pp. S143–S149, 2015.
View at: Publisher Site | Google Scholar
T. G. Soldatos, S. I. O'Donoghue, V. P. Satagopam et al., “Martini: using literature keywords to compare gene sets,” Nucleic Acids Research, vol. 38, no. 1, pp. 26–38, 2010.
View at: Publisher Site | Google Scholar
C. A. Trugenberger, C. Wälti, D. Peregrim, M. E. Sharp, and S. Bureeva, “Discovery of novel biomarkers and phenotypes by semantic technologies,” BMC Bioinformatics, vol. 14, no. 1, article 51, 2013.
View at: Publisher Site | Google Scholar
A. Holzinger, J. Schantl, M. Schroettner, C. Seifert, and K. Verspoor, “Biomedical text mining: state-of-the-art, open problems and future challenges,” in Interactive Knowledge Discovery and Data Mining in Biomedical Informatics, pp. 271–300, Springer, Berlin, Germany, 2014.
View at: Google Scholar
A. Onan and S. Korukoğlu, “A feature selection model based on genetic rank aggregation for text sentiment classification,” Journal of Information Science, 2017.
View at: Google Scholar
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval, vol. 1, No. 1, p. 496, Cambridge University Press, Cambridge, UK, 2008.
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” Machine Learning: ECML-98, pp. 137–142, 1998.
View at: Publisher Site | Google Scholar
T. K. Landauer, D. Laham, B. Rehder, and M. E. Schreiner, “How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans,” in Proceedings of the 19th annual meeting of the Cognitive Science Society, pp. 412–417, 1997.
View at: Google Scholar
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, article 391, 1990.
View at: Google Scholar
T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57, ACM, August 1999.
View at: Google Scholar
M. Girolami and A. Kabán, “Sequential activity profiling: Latent dirichlet allocation of Markov chains,” Data Mining and Knowledge Discovery, vol. 10, no. 3, pp. 175–196, 2005.
View at: Publisher Site | Google Scholar
T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on multiple classifier systems, pp. 1–15, Springer, Berlin, Germany, 2000.
View at: Publisher Site | Google Scholar
J. Mendes-Moreira, C. Soares, A. M. Jorge, and J. F. D. Sousa, “Ensemble approaches for regression: A survey,” ACM Computing Surveys (CSUR), vol. 45, no. 1, article 10, 2012.
View at: Google Scholar
F. Roli, G. Giacinto, and G. Vernazza, “Methods for designing multiple classifier systems,” Lecture Notes in Computer Science, vol. 2096, pp. 78–87, 2001.
View at: Google Scholar
Z. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: many could be better than all,” Artificial Intelligence, vol. 137, no. 1-2, pp. 239–263, 2002.
View at: Publisher Site | Google Scholar | MathSciNet
A. Onan, H. Bulut, and S. Korukoglu, “An improved ant algorithm with LDA-based representation for text document clustering,” Journal of Information Science, vol. 43, no. 2, pp. 275–292, 2017.
View at: Publisher Site | Google Scholar
A. Onan, S. Korukoğlu, and H. Bulut, “A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification,” Information Processing & Management, vol. 53, no. 4, pp. 814–833, 2017.
View at: Publisher Site | Google Scholar
G. D. C. Cavalcanti, L. S. Oliveira, T. J. M. Moura, and G. V. Carvalho, “Combining diversity measures for ensemble pruning,” Pattern Recognition Letters, vol. 74, pp. 38–45, 2016.
View at: Publisher Site | Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003.
View at: Google Scholar
K. Tian, M. Revelle, and D. Poshyvanyk, “Using latent dirichlet allocation for automatic categorization of software,” in Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories, 2009. MSR'09, pp. 163–166, IEEE, 2009.
View at: Google Scholar
Z. Zhai, B. Liu, H. Xu, and P. Jia, “Constrained LDA for grouping product features in opinion mining,” Advances in Knowledge Discovery and Data Mining, pp. 448–459, 2011.
View at: Google Scholar
Q. Wu, C. Zhang, Q. Hong, and L. Chen, “Topic evolution based on LDA and HMM and its application in stem cell research,” Journal of Information Science, vol. 40, no. 5, pp. 611–620, 2014.
View at: Publisher Site | Google Scholar
A. Bagheri, M. Saraee, and F. De Jong, “ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences,” Journal of Information Science, vol. 40, no. 5, pp. 621–636, 2014.
View at: Publisher Site | Google Scholar
L. Hong and B. D. Davison, “Empirical study of topic modeling in twitter,” in Proceedings of the first workshop on social media analytics, pp. 80–88, ACM, 2010.
View at: Google Scholar
Z. Chen, Y. Huang, J. Tian, X. Liu, K. Fu, and T. Huang, “Joint model for subsentence-level sentiment analysis with Markov logic,” Journal of the Association for Information Science and Technology, vol. 66, no. 9, pp. 1913–1922, 2015.
View at: Publisher Site | Google Scholar
A. Onan, S. Korukoglu, and H. Bulut, “LDA-based Topic Modelling in Text Sentiment Classification: An Empirical Analysis,” International Journal of Computational Linguistics and Applications, vol. 7, no. 1, pp. 101–119, 2016.
View at: Google Scholar
L. Liu, L. Tang, W. Dong, S. Yao, and W. Zhou, “An overview of topic modeling and its current applications in bioinformatics,” SpringerPlus, vol. 5, no. 1, article 1608, 2016.
View at: Google Scholar
H. Wang, M. Huang, and X. Zhu, “Extract interaction detection methods from the biological literature,” BMC Bioinformatics, vol. 10, no. 1, article S55, 2009.
View at: Google Scholar
C. W. Arnold, S. M. El-Saden, A. A. Bui, and R. Taira, “Clinical case-based retrieval using latent topic analysis,” in AMIA annual symposium proceedings, vol. 2010, p. 26, American Medical Informatics Association, 2010.
View at: Google Scholar
M. Song and S. Y. Kim, “Detecting the knowledge structure of bioinformatics by mining full-text collections,” Scientometrics, vol. 96, no. 1, pp. 183–201, 2013.
View at: Publisher Site | Google Scholar
E. Sarioglu, K. Yadav, and H. A. Choi, “Topic Modeling Based Classification of Clinical Reports,” in ACL (Student Research Workshop), pp. 67–73, 2013.
View at: Google Scholar
H. Bisgin, Z. Liu, H. Fang, X. Xu, and W. Tong, “Mining FDA drug labels using an unsupervised learning technique-topic modeling,” BMC Bioinformatics, vol. 12, no. 10, article no. S11, 2011.
View at: Publisher Site | Google Scholar
V. Wang, L. Xi, A. Enayetallah, E. Fauman, and D. Ziemek, “GeneTopics - interpretation of gene sets via literature-driven topic models,” BMC Systems Biology, vol. 7, no. 5, article no. S10, 2013.
View at: Publisher Site | Google Scholar
H. Bisgin, M. Chen, Y. Wang et al., “A systems approach for analysis of high content screening assay data with topic modeling,” BMC Bioinformatics, vol. 14, no. 14, article no. S11, 2013.
View at: Publisher Site | Google Scholar
H. Bisgin, Z. Liu, R. Kelly, H. Fang, X. Xu, and W. Tong, “Investigating drug repositioning opportunities in FDA drug labels through topic modeling,” BMC Bioinformatics, vol. 13, no. 15, article S6, 2012.
View at: Google Scholar
S.-H. Wang, Y. Ding, W. Zhao et al., “Text mining for identifying topics in the literatures about adolescent substance use and depression,” BMC Public Health, vol. 16, no. 1, article no. 279, 2016.
View at: Publisher Site | Google Scholar
X. Wang, P. Zhu, T. Liu, and K. Xu, “BioTopic: A topic-driven biological literature mining system,” International Journal of Data Mining and Bioinformatics, vol. 14, no. 4, pp. 373–386, 2016.
View at: Publisher Site | Google Scholar
R. Sullivan, A. B. E. E. D. Sarker, OK. A. R. E. N. Connor, A. M. A. N. D. A. Goodin, M. A. R. K. Karlsrud, and G. R. A. C. I. E. L. A. Gonzalez, “Finding potentially unsafe nutritional supplements from user reviews with topic modeling,” in Pacific Symposium on Biocomputing, vol. 21, pp. 528–539, World Scientific, Kohala Coast, Hawaii, 2016.
View at: Google Scholar
J. H. Chen, M. K. Goldstein, S. M. Asch, L. Mackey, and R. B. Altman, “redicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets,” Journal of the American Medical Informatics Association, ocw136, 2016.
View at: Google Scholar
M. Aksela, “Comparison of classifier selection methods for improving committee performance,” in International Workshop on Multiple Classifier Systems, pp. 84–93, Springer, Berlin, Germany, 2003.
View at: Publisher Site | Google Scholar
D. Ruta and B. Gabrys, “Application of the evolutionary algorithms for classifier selection in multiple classifier systems with majority voting,” in International Workshop on Multiple Classifier Systems, pp. 399–408, Springer, Berlin, Germany.
View at: Google Scholar
Z. H. Zhou and W. Tang, “Selective ensemble of decision trees,” Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, pp. 589–589, 2003.
View at: Google Scholar
S. Sheen and A. P. Sirisha, “Malware detection by pruning of parallel ensembles using harmony search,” Pattern Recognition Letters, vol. 34, pp. 1679–1686, 2013.
View at: Google Scholar
I. Mendialdua, A. Arruti, E. Jauregi, E. Lazkano, and B. Sierra, “Classifier Subset Selection to construct multi-classifiers by means of estimation of distribution algorithms,” Neurocomputing, vol. 157, pp. 46–60, 2015.
View at: Publisher Site | Google Scholar
D. D. Margineantu and T. G. Dietterich, “Pruning adaptive boosting,” in Proceedings of the Fourteenth International Conference on Machine Learning, pp. 211–218, San Francisco, Calf, USA, 1997.
View at: Google Scholar
R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes, “Ensemble selection from libraries of models,” in Proceedings of the 21st International Conference on Machine Learning (ICML '04), pp. 18–39, Banff, Canada, July 2004.
View at: Google Scholar
Q. Dai, T. Zhang, and N. Liu, “A new reverse reduce-error ensemble pruning algorithm,” Applied Soft Computing, vol. 28, pp. 237–249, 2015.
View at: Publisher Site | Google Scholar
S. B. Kotsiantis and P. E. Pintelas, “Selective averaging of regression models,” Annals of Mathematics, Computing & Teleinformatics, vol. 1, no. 3, pp. 65–74, 2005.
View at: Google Scholar
M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera, “Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets,” Information Sciences, vol. 354, pp. 178–196, 2016.
View at: Publisher Site | Google Scholar
H. Zhang and L. Cao, “A spectral clustering based ensemble pruning approach,” Neurocomputing, vol. 139, pp. 289–297, 2014.
View at: Publisher Site | Google Scholar
T. L. Griffiths, M. I. Jordan, J. B. Tenenbaum, and D. M. Blei, “Hierarchical topic models and the nested chinese restaurant process,” in Advances in neural information processing systems, pp. 17–24, 2004.
View at: Google Scholar
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters among related groups: Hierarchical Dirichlet processes,” in Advances in Neural Information Processing Systems, pp. 1385–1392, 2005.
View at: Google Scholar
S. Grant and J. R. Cordy, “Estimating the optimal number of latent concepts in source code analysis,” in Proceedings of the 10th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM '10), pp. 65–74, IEEE, September 2010.
View at: Publisher Site | Google Scholar
A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshynanyk, and A. De Lucia, “How to effectively use topic models for software engineering tasks? An approach based on genetic algorithms,” in Proceedings of the 35th International Conference on Software Engineering (ICSE '13), pp. 522–531, IEEE Press, May 2013.
View at: Publisher Site | Google Scholar
W. Zhao, J. J. Chen, R. Perkins et al., “A heuristic approach to determine an appropriate number of topics in topic modeling,” BMC Bioinformatics, vol. 16, no. 13, article no. S8, 2015.
View at: Publisher Site | Google Scholar
A. Karami, A. Gangopadhyay, B. Zhou, and H. Kharrazi, “Fuzzy Approach Topic Discovery in Health and Medical Corpora,” International Journal of Fuzzy Systems, pp. 1–12, 2017.
View at: Google Scholar
R. Mousavi and M. Eftekhari, “A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches,” Applied Soft Computing, vol. 37, pp. 652–666, 2015.
View at: Publisher Site | Google Scholar
M. Jordan, Learning in graphical models, MIT Press, Cambridge, Mass, USA, 1999.
L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1-2, pp. 1–39, 2010.
View at: Publisher Site | Google Scholar
Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, Chapman and Hall, New York, NY, USA, 2012.
View at: MathSciNet
L. Breiman, “Bagging predictors,” Machine Learning, vol. 4, no. 2, pp. 123–140, 1996.
View at: Google Scholar
T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998.
View at: Publisher Site | Google Scholar
D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992.
View at: Publisher Site | Google Scholar
Q. Sun and B. Pfahringer, “Bagging ensemble selection,” in Proceedings of the 24th Australasian Joint Conference on Artificial Intelligence, pp. 251–260, Australia, 2011.
View at: Google Scholar
S. Cheng, B. Liu, T. O. Ting, Q. Qin, Y. Shi, and K. Huang, “Survey on data science with population-based algorithms,” Big Data Analytics, vol. 1, no. 1, article 3, 2016.
View at: Google Scholar
J. H. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence, MIT press, 1992.
J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of the IEEE International Conference on Neural Networks, pp. 1942–1948, Perth, Australia, December 1995.
View at: Google Scholar
X.-S. Yang, “A new metaheuristic bat-inspired algorithm,” in Nature Inspired Cooperative Strategies for Optimization (NICSO 2010), pp. 65–74, 2010.
View at: Publisher Site | Google Scholar
X.-S. Yang and S. Deb, “Engineering optimisation by Cuckoo search,” International Journal of Mathematical Modelling and Numerical Optimisation, vol. 1, no. 4, pp. 330–343, 2010.
View at: Publisher Site | Google Scholar
X. S. Yang, Nature-inspired metaheuristic algorithms, Luniver press, 2010.
E. Rendón, I. M. Abundez, C. Gutierrez et al., “A comparison of internal and external cluster validation indexes,” in Proceedings of the 2011 American Conference, vol. 29, San Francisco, Calf, USA, 2011.
View at: Google Scholar
D. J. Poirier, Intermediate statistics and econometrics: a comparative approach, MIT Press, 1995.
View at: MathSciNet
L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy,” Machine Learning, vol. 51, no. 2, pp. 181–207, 2003.
View at: Publisher Site | Google Scholar
R. G. Rossi, R. M. Marcacini, and S. O. Rezende, Benchmarking text collections for classification and clustering tasks, Institute of Mathematics and Computer Sciences, University of Sao Paulo, 2013.
H. Narasimhan, W. Pan, P. Kar, P. Protopapas, and H. G. Ramaswamy, “Optimizing the Multiclass F-Measure via Biconcave Programming,” in Proceedings of the IEEE 16th International Conference on Data Mining (ICDM), pp. 1101–1106, IEEE, 2016.
View at: Google Scholar
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
View at: Google Scholar
X. Min, L. Liu, Y. He et al., Benchmarking swarm intelligence clustering algorithms with case study of medical data, 2016.
P. Das, D. K. Das, and S. Dey, “A New Class Topper Optimization Algorithm with an Application to Data Clustering,” IEEE Transactions on Emerging Topics in Computing, 2018.
View at: Google Scholar

Copyright

Copyright © 2018 Aytuğ Onan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2929

Downloads

1604

Citations