Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 8310934, 13 pages

https://doi.org/10.1155/2017/8310934

## A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

^{1}College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China^{2}China Information Technology Security Evaluation Center, Beijing 100085, China

Correspondence should be addressed to Xianghui Zhao

Received 9 October 2016; Revised 1 February 2017; Accepted 16 February 2017; Published 15 March 2017

Academic Editor: Nazrul Islam

Copyright © 2017 Junkai Yi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. The vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance.

#### 1. Introduction

Webpages, microblogs, and social networks provide much useful information for us, and text clustering is an important text mining method to collect valuable information on the Internet. Text clustering helps us to group an enormous amount of text documents into small meaningful clusters, which have been used in many research fields such as sentiment analysis (opinion mining) [1–3], text classification [4–6], text summarization [7], and event tracking and topic detection [8–10].

The process of text clustering is usually divided into two phases: preprocessing phase and clustering phase. Before preprocessing phase, there are some basic steps (including tokenization, remove-stop-words, and stemming-word) needed to process text documents, and these steps split sentences into words and remove useless words or terms.

The first phase is the preprocessing of text, and the second phase is clustering for text documents. The preprocessing phase is mainly to transform text documents into structured data that can be processed by clustering algorithms. This phase contains two parts: feature extraction and feature selection.

In existing scientific literatures, there are two categories of feature extraction methods: term frequency-based method and semantic web-based method. Term frequency-based method is a method of counting words’ number, and semantic web is to construct the knowledge in certain domain to an ontology, which contains words and their relations.

Term-document vectors are extracted from text documents in the process of feature extraction. Most term frequency-based methods employ vector space model (VSM) to represent text documents, and each entry of VSM is the frequency of words or terms. The most representative method based on term frequency is term frequency-inverse document frequency (tf-idf) algorithm. For its simplicity and high efficiency, researchers have proposed many improved tf-idf algorithms [11, 12].

However, the relations of words (or word order) are lost when text documents are transformed into term-document vectors. Many researchers find that the words or terms have lexical “cooccurrence” phenomenon [13], which means some words or terms have a high probability of occurrence in a text document. Researchers think that the “cooccurrence” relations of words or terms can generate more precise feature vectors to represent the meaning of text documents.

The objective of feature selection is to remove redundant information and reduce the dimensionality of term-document vectors. The methods of feature selection are categorized as corpus-based method, Latent Semantic Indexing (LSI), and subspace-based clustering. The corpus-based method merges synonyms together to reduce the dimensionality of features, which depends on large corpora such as WordNet and HowNet. Traditional LSI decomposes a term-document vector into a term-space matrix by singular value decomposition (SVD). Subspace-based clustering groups text documents in a low-dimensional subspace.

In our paper, we propose a novel approach to address two issues: one is the loss of word relations in the process of feature extraction, and the other is to retain the word relations in dimension reduction. Considering that the relations of words and terms are lost in term frequency-based methods, we construct a vocabulary network to retain “cooccurrence” relations of words or terms. Term frequency is replaced with the “importance” of words or terms in VSM. Furthermore, traditional feature selection methods can lose some information that affects the performance of clustering [14], and we introduce deep learning for dimension reduction.

The main contributions of our paper are that we present a novel graph-based approach for text clustering, called deep-learning vocabulary network (DLVN). We employ the edges of vocabulary network to represent the relations between words or terms and extract features of text documents in terms of related-word set. The related-word set is a set of words in the same class, and we utilize association rules learning to obtain relations between words. In addition, high dimensional and sparse features of text have a big influence on clustering algorithms, and we employ deep learning for dimensionality reduction. Accordingly, an improved deep-learning Single-Pass (DL-SP) is used in the process of clustering. To verify the effectiveness of the approach, we provide our experimental evaluation based on Chinese corpora.

The rest of this paper is organized as follows. Section 1 reviews related work in previous literatures. Section 2 introduces theoretical foundation related to this paper. Section 3 describes the approach of DLVN we propose. Section 4 is experimental analysis. Section 5 is the conclusion of our work.

#### 2. Related Work

Text clustering groups text documents of similar content (so-called topic) into a cluster. In this section, we use three subsections to review related literatures.

##### 2.1. Feature Extraction

Term frequency-based method is an important method to extract features. In term frequency-based method, text documents are represented as VSM, and each document is transformed into a vector, whose entries are the frequency of words or terms. Most term frequency-based methods are to improve tf-idf.

Semantic web is to structure knowledge into an ontology. As researchers find that the relations between words contribute to understanding the meaning of text, they construct a semantic network in terms of concepts, events, and their relations. Yue et al. [15] constructed a domain-specific ontology to describe the hazards related to dairy products and translated the term-document vectors (namely, feature vectors of text) into a concept space. Wei et al. [16] exploited an ontology hierarchical structure for word sense disambiguation to assess similarity of words. The experiment results showed better clustering performance for ontology-based methods considering the semantic relations between words. Bing et al. [17] proposed an adaptive concept resolution (ACR) model for the characteristics of text documents, and ACR was an ontology-based method of text representation. However, the efficiency of semantic web analysis is a challenge for researchers, and the large scale of text corpora has a great influence on algorithms [18].

For retaining the relations of words and terms, some researchers proposed to employ graph-based model in text clustering [19, 20]. Mousavi et al. [21] proposed a weighted-graph representation of text to extract semantic relations in terms of parse trees of sentences. In our work, we introduce frequent itemsets to construct related-word set, and use each itemset of related-word set to represent the relations between words. Language is always changing, and new words are appearing every day. Related-word set can capture the change of language by mining frequent itemsets.

##### 2.2. Feature Selection

Feature selection is a feature construction method to transform a high dimensional feature space into a low-dimensional feature space. SVD is a representative method using mathematical theory for dimension reduction. Jun et al. [22] combined SVD and principal-component analysis (PCA) for dimensionality reduction. Zhu and Allen [23] proposed a latent semantic indexing subspace signature model (LSISSM) based on LSI and transformed term-document vectors into a low-rank approximation for dimensionality reduction. However, LSI selects a new feature subset to construct a semantic space, which loses some important features and suffers from the irrelevant features.

Due to the sparsity and high-dimensionality of text features, the performance of the subspace-based clustering is better than traditional clustering algorithm [24, 25]. Moreover, some researchers integrate many related theories for dimensionality reduction. Bharti and Singh [26] proposed a hybrid intelligent algorithm, which integrated binary particle swarm optimization, chaotic map, dynamic inertia weight, and mutation for feature selection.

##### 2.3. Clustering Algorithm

Clustering is an unsupervised approach of machine learning, and it groups similar objects into a cluster. The most representative clustering algorithm is partitional clustering such as* k-means* and* k-medoids* [27], and each cluster has a center called centroid in partitional clustering. Mei and Chen [28] proposed a clustering around weighted prototypes (CAWP) based on new cluster representation method, where each cluster was represented by multiple objects with various weights. Tunali et al. [29] improved spherical* k-means* (SKM) and proposed a multicluster spherical* k-means* (MCSKM), which allowed documents to be assigned more than one cluster. Li et al. [30] introduced a concept of neighbor and proposed a parallel* k-means* based on neighbors (PKBN).

Another representative clustering algorithm is hierarchical clustering, which contains divisive hierarchical clustering and agglomerative hierarchical clustering [31]. Peng and Liu [32] proposed an incremental hierarchical text clustering approach, which represented a cluster hierarchy using CFu-tree. In addition, Chen et al. [33] proposed an improved density clustering algorithm named density-based spatial clustering of applications with noise (DBSCAN). DBSCAN was sensitive to choosing parameters; the authors combined* k-means* to estimate the parameters.

Ensemble clustering is another clustering algorithm. Ensemble clustering combines the multiple results of different clustering algorithms to obtain final results. Multiview clustering is an extension of ensemble clustering and combines different data that have different properties and views [34, 35].

Matrix factorization-based clustering is an important clustering approach [36]. Lu et al. [37] proposed a semisupervised concept factorization (SSCF), which contained nonnegative matrix factorization and concept factorization for text clustering. SSCF integrated penalized and reward terms by pairwise constraints* must-link* constraints and* cannot-link* constraints , which implied two documents belonging to the same cluster or different clusters.

Topic-based text clustering is an effective text clustering approach, in which text documents are projected into a topic space. Latent Dirichlet allocation (LDA) is a common topic model. Yau et al. [38] separated scientific publications into several clusters based on LDA. Ma et al. [39] employed the topic model of LDA to represent the centroids of clusters and combined* k-means++* algorithm for document clustering.

In some literatures, additional information is introduced for text clustering such as side-information [40] and privileged information [41]. What is more, several global optimization algorithms are utilized for text clustering such as particle swarm optimization (PSO) algorithm [42, 43] and bee colony optimization (BCO) algorithm [44, 45].

Similarity measure is also an important issue in text clustering algorithms. To compute the similarity between a text document and a cluster is a fundamental problem in clustering algorithms. The most common similarity measure is distance metric such as Euclidean distance, Cosine distance, and Generalized Mahalanobis distance [46]. There exist other similarity measure methods such as IT-Sim (an information-theoretic measure) [47]. Besides similarity measure, measurement of discrimination information (MDI) is an opposite concept to compute the relations of text documents [48–50].

#### 3. Theoretical Foundation

In this section, we describe some theories related to our work. This section contains three subsections, which are frequent pattern maximal (FPMAX), PageRank, and deep belief network (DBN).

##### 3.1. FPMAX

FPMAX is a depth-first and recursive algorithm for mining maximal frequent itemsets (MFIs) in given dataset [51]. Before FPMAX is called, frequent pattern tree (FP-tree) is structured to store frequent itemsets, and each branch of the FP-tree is a representation of a frequent itemset. FP-tree includes a linked list head, which contains all items of the dataset. Maximal frequent itemset tree (MFI-tree) is introduced to store all MFIs in FPMAX. The procedure of FPMAX is described Algorithm 1.