Table of Contents Author Guidelines Submit a Manuscript
Computational Intelligence and Neuroscience
Volume 2016, Article ID 1096271, 11 pages
http://dx.doi.org/10.1155/2016/1096271
Research Article

Using SVD on Clusters to Improve Precision of Interdocument Similarity Measure

1Center on Big Data Sciences, Beijing University of Chemical Technology, Beijing 100039, China
2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China

Received 2 March 2016; Accepted 8 June 2016

Academic Editor: Toshihisa Tanaka

Copyright © 2016 Wen Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Recently, LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, SVD on clusters is proposed to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we propose SVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. Moreover, we develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods. Experiments demonstrate that, to some extent, SVD on clusters can improve the precision of interdocument similarity measure in comparison with other SVD based LSI methods.

1. Introduction

As computer networks become the backbones of science and economy, enormous quantities of machine readable documents become available. The fact that about 80 percent of businesses are conducted on unstructured information [1, 2] makes the great demand for the efficient and effective text mining techniques which aims to discover high quality knowledge from unstructured information. Unfortunately, the usual logic-based programming paradigm has great difficulties in capturing fuzzy and often ambiguous relations in text documents. For this reason, text mining, which is also known as knowledge discovery from texts, is proposed to deal with uncertainness and fuzziness of languages and disclose hidden patterns (knowledge) in documents.

Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy and homonym), so terms in a user’s query will literally match terms in irrelevant documents. For these reasons, a better approach would allow users to retrieve information on the basis of a conceptual topic or meanings of a document [3, 4].

Latent Semantic Indexing (LSI) is proposed to overcome the problem of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval [5, 6]. We call this retrieval method Latent Semantic Indexing because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. Using the singular value decomposition (SVD), one can take advantage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term-document matrix. Terms and documents represented by a reduced dimension of the largest singular vectors are then matched against user queries. Performance data shows that the statistically derived term-document matrix by SVD is more robust to retrieve documents based on concepts and meanings than the original term-document matrix produced using merely individual words with vector space model (VSM).

In this paper, we propose SVD on clusters (SVDC) to improve the discriminative power of LSI. The contribution of this paper is three manifolds. Firstly, we make a survey of existing linear algebra methods for LSI, including both SVD based methods and non-SVD based methods. Secondly, we theoretically explain that dimension expansion of document vectors and dimension projection using SVD are the two manipulations involved in SVD on clusters. We develop updating processes to fold in new documents and terms in a decomposed matrix by SVD on clusters. Thirdly, two corpora, a Chinese corpus and an English corpus, are used to evaluate the performances of the proposed methods.

The rest of this paper is organized as follows. Section 2 provides a survey on recent researches on Latent Semantic Indexing and its related topics. Section 3 proposes SVD on clusters and its updating process. Section 4 is the experiment to evaluate the proposed methods. Section 5 concludes this paper and indicates future work.

2. Related Work

2.1. Singular Value Decomposition

The singular value decomposition is commonly used in the solution of unconstrained linear least square problems, matrix rank estimation, and canonical correlation analysis [7, 8]. Given matrix , where without loss of generality and , the singular value decomposition of , denoted by , is defined as

Here and , for and for . The first columns of the orthonormal matrices and define the orthonormal eigenvector associated with nonzero eigenvalues of and , respectively. The columns of and are referred to as the left and right singular vectors, respectively, and the singular values of are defined as the diagonal elements of which are the nonnegative square roots of the eigenvalues of . Furthermore, if we define , then we will find that is the best approximation for in terms of Frobenius norm [7].

2.2. Recent Studies in LSI

Recently, a series of methods based on different methods of matrix decomposition have been proposed to conduct LSI. A common point of these decomposition methods is to find a rank-deficient matrix in the decomposed space to approximate the original matrix so that the term frequency distortion in term-document can be adjusted. Basically, we can divide these methods into two categories: matrix decomposition based on SVD and matrix decomposition not based on SVD. Table 1 lists the existing linear algebraic methods for LSI.

Table 1: Existing linear algebra methods for LSI.

In the aspect of SVD based LSI methods, it includes IRR [9], SVR [10], and ADE [11]. Briefly, IRR conjectures that SVD removes two kinds of “noises” from the original term-document matrix: exceptional documents and documents with minor terms. However, if our concentration is on characterizing relationships of documents in a collection rather than looking for representative documents, then IRR can play an effective role for this work. The basic idea behind SVR is that the “noise” in original document representation vectors comes from minor vectors, that is, those vectors which are far from representative vectors in terms of distance. Thus, we need to augment the influence of representative vectors and meanwhile reduce the influence of minor vectors in the approximation matrix. Following this idea, SVR adjusts the differences among major dimensions and minor dimensions in the approximation matrix by rescaling the singular values in . Based on the observation that singular values in have the characteristics as low-rank-plus-shift structure, ADE tries to flatten the first largest singular values with a fixed value and combine with other small singular values to reconstruct to make dimension values relatively equalized in the approximation matrix of .

In the aspect of non-SVD based LSI methods, it includes SDD [12], LPI [13], and R-SVD [14]. SDD restricts values in singular vectors ( and ) in approximation matrix only having entries in the set . By this way, it merely needs one-twentieth of storage and only one-half query time while it can do and SVD does LSI in terms of information retrieval. LPI argues that LSI seeks to uncover the most representative features rather the most discriminative features for document representation. With this motivation, LPI constructs the adjacency graph of documents and aims to discover the local structure of document space using Local Preserving Projection (LPP). In essence, we can regard that LPI is adapted from LDA (Linear Discriminant Analysis) [15], which is a topic concerning dimension reduction for supervised classification. R-SVD is different with SVD mathematically in that the term-document matrix decomposition of SVD is based on Total Least Square (TLS) while matrix decomposition in R-SVD is based on Structured Total Least Square (STLS). R-SVD is not designed for LSI but for information filtering to improve the effectiveness of information retrieval by using users’ feedback.

Recently, two methods in [16, 17] are presented which also make use of SVD and clustering. In [16], Gao and Zhang investigate three strategies of using clustering and SVD for information retrieval as noncluster retrieval, full-cluster retrieval, and partial cluster retrieval. Their study shows that partial cluster retrieval produces the best performance. In [17], Castelli et al. make use of clustering and singular value decomposition for nearest-neighbor search in image indexing. They use SVD to rotate the original vectors of images to produce zero-mean, uncorrelated features. Moreover, a recursive clustering and SVD strategy is also adopted in their method when the distance of reconstructed centroids and original centroids exceeds a threshold.

Although the two methods are very similar with SVD on clusters, they are proposed for different uses with different motivations. Firstly, this research presents a complete theory for SVD on clusters, including theoretical motivation, theoretical analysis of effectiveness, and updating process, which are entirely not mentioned in any of the two referred methods. Secondly, this research describes the detailed procedures of using SVD on clusters and attempts to use different clustering methods (-Means and SOMs clustering), which are not mentioned in any of the two referred methods, either. Thirdly, the motivations of proposing SVDC are different with theirs. They proposed clustering and SVD for inhomogeneous data sets and our motivation is to improve the discriminative power of document indexing.

3. SVD on Clusters

3.1. The Motivation

The motivation for the proposal of SVD on clusters can be specified as the following 4 aspects:(1)The huge computation complexity involved in traditional SVD. According to [18], the actual computation complexity of SVD is quadratic in the rank of term-document matrix (the rank is bounded by the smaller of the number of documents and the number of terms) and cubic in the number of singular values that are computed [19]. On the one hand, in most cases of SVD for a term-document matrix, the number of documents is quite smaller than the number of index terms. On the other hand, the number of singular values, which is equal to the rank of the term-document matrix, is also dependent on the number of documents. For this reason, we can regard that the computation complexity of SVD is completely determined by the number of documents in the term-document matrix. That is to say, if the number of documents in the term-document matrix is reduced, then the huge computation complexity of SVD can be reduced as well.(2)Clusters existing in a document collection. Usually, there are different topics scattered in different documents of a text collection. Even if all documents in a collection are concerning on a same topic, we can divide them into several subtopics. Although SVD has the ability to uncover the most representative vectors for text representation, it might not be optimal in discriminating documents with different semantics. In information retrieval, the relevant documents with the query should be retrieved as many as possible; on the other hand, the irrelevant documents with the query should be retrieved as few as possible. If principal clusters, in which documents have closely related semantics, can be extracted automatically, then relevant documents can be retrieved in the cluster with the assumption that closely associated documents tend to be relevant to the same request; that is, relevant documents are more like one another than they are like nonrelevant documents.(3)Contextual information and cooccurrence of index terms in documents. Classic weighting schemes [20, 21] are proposed on the basis of information about the frequency distribution of index terms within the whole collection or within the relevant and nonrelevant sets of documents. The underlying model for these term weighting schemes is a probabilistic one and it assumes that the index terms used for representation are distributed independently in documents. Assuming variables to be independent is usually a matter of mathematical convenience. However, in the nature of information retrieval, exploitation of dependence or association between index terms or documents will often lead to a better retrieval results such as most linear algebra methods proposed for LSI [3, 22]. That is, from mathematical point of view, the index terms in documents are dependent on each other. In the viewpoint of linguistics, topical words are prone to have burstiness in documents and lexical words concerning the same topic are likely to cooccur in the same content. That is, the contextual words of an index term should also be emphasized and put together when used for retrieval. In this sense, capturing the cooccurrence of index terms in documents and further capturing the cooccurrence of documents with some common index terms are of great importance to characterize the relationships of documents in a text collection.(4)Divide-and-conquer strategy as theoretical support. The singular values in of SVD of term-document matrix have the characteristic as low-rank-plus-shift structure; that is, the singular values decrease sharply at first, level off noticeably, and dip abruptly at the end. According to Zha et al. [23], we know that if has the low-rank-plus-shift structure, then the optimal low-rank approximation of can be computed via a divide-and-conquer approach. That is to say, approximation of submatrices of can also produce comparable effectiveness in LSI to direct SVD of .

With all of the above observations from both practices and theoretical analysis, SVD on clusters is proposed for LSI to improve its discriminative power in this paper.

3.2. The Algorithms

To proceed, the basic concepts adopted in SVD on clusters are defined in the following in order to make clear the remainder of this paper.

Definition 1 (cluster submatrix). Assuming that is a term-document matrix, that is, ( is a term-document vector), after clustering process, document vectors are partitioned into disjoint groups (each document belongs to only one group but all the documents have the same terms for representation). For each of these clusters, a submatrix of can be constructed by grouping the vectors of documents which are partitioned into the same cluster by clustering algorithm. That is, , due to the fact that changing the order of documents vectors in can be ignored. Then, one calls that    is a cluster submatrix of .

Definition 2 (SVDC approximation matrix). Assuming that are the all cluster submatrices of , that is, , after SVD for each of these cluster submatrices, that is, , , and is the rank of SVD approximation matrix of and, is the SVD approximation matrix of , then one calls that is a SVDC approximation matrix of .

With the above two definitions of cluster submatrix and SVDC approximation matrix, we proposed two versions of SVD on clusters by using -Means clustering [24] and SOMs (Self-Organizing Maps) clustering [25]. These two versions are illustrated in Algorithms 3 and 4, respectively. The difference of these two versions lies in different clustering algorithms used in them. For -Means clustering, we need to predefine the number of clusters in the document collection and for SOMs clustering, it is not necessary to predefine the number of clusters beforehand.

Algorithm 3. Algorithm of SVD on clusters with -Means clustering to approximate the term-document matrix for LSI is as follows:Input: is term-document matrix; that is, . is predefined number of clusters in . are predefined rank of SVD approximation matrix for clusters submatrices of .Output: is the SVDC approximation matrix of .Method:(1)Cluster the document vectors into clusters using -Means clustering algorithm.(2)Allocate the document vectors according to vectors’ cluster labels from to construct the cluster submatrices .(3) Conduct SVD for each of the cluster submatrices of and produce their SVD approximation matrix, respectively. That is, .(4) Merge all the SVD approximation matrices of the cluster submatrices to construct the SVDC approximation matrix of . That is, .

3.3. Theoretical Analysis of SVD on Clusters

For simplicity, here, we only consider the case that term-document is clustered into two cluster submatrices and ; that is, After SVD processing for and , we obtain and . For convenience of explanation, if we assume thatwe will obtain that and ; that is, and are orthogonal matrices. Hence, we will also obtainwhere is the total number of elements in and which are nonzeros. Thus, we can say that is a singular decomposition of and is the closet approximation for in terms of Frobenius norm (assuming that we sort the values in in descending order and adapt the orders of and accordingly).

We can conclude that there are actually two kinds of manipulations involved in SVD on clusters: the first one is dimension expansion of document vectors and the second one is dimension projection using SVD.

On the one hand, notice that and , has expanded into another space where the number of dimensions is twice as that of the original space of . That is, in , we expanded each document vector into dimension vector by Here, is the value of dimension in , is the value of dimension of , and . In this way, we expanded each into dimension vector where values of are equal to the corresponding values of , if belongs to cluster or zero, if is not a member of that cluster.

Theoretically, according to the explanation, document vectors which are not in the same cluster submatrix will have zero cosine similarity. However, in fact, all document vectors have the same terms in representation and dimension expansion of document vectors is derived by merely copying the original pace of . For this reason, in practice, we use the vectors in and for indexing and cosine similarities of document vectors in and will not necessarily be zero. This validates our motivation of using similarity measure for LSI performance evaluation in Section 4.2.

Algorithm 4. Algorithm of SVD on clusters with SOMs clustering to approximate the term-document matrix for LSI is as follows:Input: is term-document matrix; that is, . is predefined preservation rate for submatrices of .Output: is the SVDC approximation matrix of .Method:(1)Cluster the document vectors into clusters using SOMs clustering algorithm.(2)Allocate the document vectors’ according to vectors’ cluster labels from to construct the cluster submatrices (notice here that is not a predefined number of clusters of but the number of neurons which are matched with at least 1 document vector).(3) Conduct SVD using predefined preservation rate for each cluster submatrix of and produce its SVD approximation matrix. That is, .(4) Merge all the SVD approximation matrices of the cluster submatrices to construct the SVDC approximation matrix of . That is, .

On the other hand, when using SVD for , that is, , we obtain and further we say that SVD has folded each document vector of into a reduced space (assuming that we use for the left multiplication of , the number of dimensions of original document vectors will be reduced to ), which is represented by and reflects the latent semantic dimensions characterized by term cooccurrence of [3]. In the same way, for , we have and further we may say that is projected into space which is represented by . However, here is not characterized by term cooccurrence of but by the existing clusters of and the term cooccurrence of each cluster submatrix of .

3.4. The Computation Complexity of SVD on Clusters

The computation complexity of SVDC is , where is the maximum number of documents in and is the corresponding to approximate cluster submatrix . Because the original term-document matrix is partitioned into cluster submatrices by clustering algorithm, we can estimate and . That is to say, the computation complexity of SVD compared to that of SVDC has been decreased by approximate . The larger the value of is, that is, the more the document clusters setting for a document collection is, the more computation complexity which will be saved by SVD on clusters in matrix factorization is. Although one may argue that clustering process in SVD on clusters will bring about computation complexity, in fact, the cost of clustering computation is far smaller than that of SVD. For instance, the computation complexity of -Means clustering is [24], where and have the same meaning as those in SVD on clusters and is the number of iterations. The computation complexity of clustering is not comparable to the complexity involved in SVD. The computation complexity of SOMs clustering is in the similar case with -Means clustering.

3.5. Updating of SVD on Clusters

In rapidly changing environments such as the World Wide Web, the document collection is frequently updated with new documents and terms constantly being added, and there is a need to find the latent-concept subspace for the updated document collection. In order to avoid recomputing the matrix decomposition, there are two kinds of updates for an established latent subspace of LSI: folding in new documents and folding in new terms.

3.5.1. Folding in New Documents

Let denote new document vectors to be appended into the original term-document matrix ; then is matrix. Thus, the new term-document matrix is . Then . That is, if is appended into the original matrix , and . However, here is not an orthogonal matrix like . So is not the closest approximation matrix to in terms of Frobenius norm. This is the reason why more documents are appended in ; more deteriorating effects are produced on the representation of the SVD approximation matrix using folding in method.

Despite this, to fold in new document vectors into an existing SVD decomposition, a projection of onto the span of the current term vectors (columns of ) is computed by (5). Here, is the rank of the approximation matrix:

As for folding in these new document vectors into the established SVDC decomposition of matrix , we should decide firstly the cluster submatrices of into which each vector in should be appended. Next, using (5), we can fold in the new document vector into the cluster submatrix. Assuming that is a new document vector of , first, the Euclidean distance between and ( is the cluster center of cluster submatrix ) is calculated using (6), where is the dimension of , that is, the number of terms used in . One has

Second, is appended into the th cluster where has the minimum Euclidean distance with th cluster. That is,

Third, (5) is used to update the SVD of . That is,

Here, is the rank of approximation matrix of . Finally, is updated as with

Thus, we finish the process of folding in a new document vector into SVDC decomposition and the centroid of th cluster is updated with new document. The computational complexity of updating SVDC depends on the size of and because it involves only one-way matrix multiplication.

3.5.2. Folding in New Terms

Let denote a collection of term vectors for SVD update. Then is matrix. Thus, we have the new term-document , with . Then . That is, and . Here, is not an orthonormal matrix. So is not the closest approximation matrix to in terms of Frobenius norm. Thus, the more the terms being appended into the approximation matrix are, the more the deviation between and which will be induced in document representation is.

Although the method specified above has a disadvantage of SVD for folding in new terms, we do not have better method to tackle this problem until now if no recomputing of SVD is desired. To fold in term vectors into an existing SVD decomposition, a projection, , of onto the span of current document vectors (rows of ) is determined by

Concerning folding in an element of , the updating process of SVDC is more complex than that of SVD. First, the weight of in each document of each cluster is calculated as

Here, is the weight of the new term in the jth document of ith cluster submatrix . is the number of documents in and is the number of clusters in the original term-document matrix . Second, for each in of Definition 2, the process of folding in a new term in SVD is used to update each shown in

Then, each is updated using

Finally, approximation term-document of Definition 2 is reconstructed with all updated as

Thus, we finish the process of folding into SVDC decomposition. For folding term vectors into an existing SVDC decomposition, we need to repeat the processes of (11)–(14) for each element of one by one.

4. Experiments and Evaluation

4.1. The Corpus

Reuters-21578 distribution 1.0 is used for performance evaluation as the English corpus and it is available online (http://www.daviddlewis.com/resources/testcollections/reuters21578/). It collects 21,578 news from Reuters newswire in 1987. Here, the documents from 4 categories as “crude” (520 documents), “agriculture” (574 documents), “trade” (514 documents), and “interest” (424 documents) are assigned as the target English document collection. That is, 2,042 documents from this corpus are selected for evaluation. After stop-word (we obtain the stop-words from USPTO (United States Patent and Trademark Office) patent full-text and image database at http://patft.uspto.gov/netahtml/PTO/help/stopword.htm. It includes about 100 usual words. The part of speech of English word is determined by QTAG which is a probabilistic parts-of-speech tagger and can be downloaded freely online: http://www.english.bham.ac.uk/staff/omason/software/qtag.html) elimination and stemming processing (Porter stemming algorithm is used for English stemming processing which can be downloaded freely online: http://tartarus.org/~martin/PorterStemmer/), a total amount of 50,837 sentences and 281,111 individual words in these documents is estimated.

TanCorpV1.0 is used as the Chinese corpus in this research which is available in the internet (http://www.cnblogs.com/tristanrobert/archive/2012/02/16/2354973.html). Here, documents from 4 categories as “agriculture,” “history,” “politics,” and “economy” are assigned as target Chinese corpus. For each category, 300 documents were selected randomly from original corpus, obtaining a corpus of 1,200 documents. After morphological analysis (because Chinese is character based, we conducted the morphological analysis using the ICTCLAS tool. It is a Chinese Lexical Analysis System. Online: http://ictclas.nlpir.org/), a total amount of 219,115 sentences and 5,468,301 individual words is estimated.

4.2. Evaluation Method

We use similarity measure as the method for performance evaluation. The basic assumption behind similarity measure is that document similarity should be higher for any document pair relevant to the same topic (intratopic pair) than for any pair relevant to different topics (cross-topic pair). This assumption is based on consideration of how the documents would be used by applications. For instance, in text clustering by -Means, clusters are constructed by collecting document pairs having the greatest similarity at each updating.

In this research, documents in same category are regarded as having same topic and documents in different category are regarded as cross-topic pairs. Firstly, document pairs are produced by coupling each document vector in a predefined category and another document vector in the whole corpus, iteratively. Secondly, cosine similarity is computed for each document pair, and all the document pairs are sorted in a descending order by their similarities. Finally, (15) and (16) are used to compute the average precision of similarity measure. More details concerning similarity measure can be found in [9]. One has

Here, denotes the document pair that has the greatest similarity value of all document pairs. is varied from 1 to and is the number of total document pairs. The larger the average precision () is, the more the document pairs in same categories which are regarded as having same topic are. That is, the better performance is produced. A simplified method may be that is predefined as fixed numbers such as 10, 20, and 200 (as suggested by one of the reviewers). Thus, (16) is not necessary. However, due to the lack of knowledge of the optimal , we conjecture that an average precision on all possible is more convincing for performance evaluation.

4.3. Experimental Results of Indexing

For both Chinese and English corpus, we carried out experiments for measuring similarities of documents in each category. When using SVDC in Algorithm 3 for LSI, the predefined number of clusters in -Means clustering algorithm is set as 4 for both Chinese and English documents, which is equal to the number of categories used in both corpora. In SOMs clustering when using SVDC in Algorithm 4 for LSI, array of neurons is set to map the original document vectors to this target space, and the limit on time iteration is set as 10,000. As a result, Chinese documents are mapped to 11 clusters and English documents are mapped to 16 clusters. Table 2 shows the -measure values [26] of the clustering results produced by -Means and SOMs clustering, respectively. The larger the -measure value, the better the clustering result. Here, -Means has produced better clustering results than SOMs clustering algorithm.

Table 2: -measures of clustering results produced by -Means and SOMs on Chinese and English documents.

Average precision (see (16)) on the 4 categories of both English and Chinese documents is used as the performance measure. Tables 3 and 4 are the experimental results of similarity measure on the English and Chinese documents, respectively. For SVD, SVDC, and ADE, the only required parameter to compute the latent subspace is preservation rate, which is equal to , where is the rank of the approximation matrix. For IRR and SVR, besides the preservation rate, they also need another parameter as rescaling factor to compute the latent subspace.

Table 3: Similarity measure on English documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.
Table 4: Similarity measure on Chinese documents of SVD on clusters and other SVD based LSI methods. PR is the abbreviation for “preservation rate” and the best performances (measured by average precision) are marked in bold type.

To compare document indexing methods at different parameter settings, preservation rate is varied from 0.1 to 1.0 in increments of 0.1 for SVD, SVDC, SVR, and ADE. For SVR, its rescaling factor is set to 1.35, as suggested in [10] for optimal average results in information retrieval. For IRR, its preservation rate is set as 0.1 and its rescaling factor is varied from 1 to 10, the same as in [13]. Note that in Tables 3 and 4 for IRR, the preservation rate of 1 corresponds to rescaling factor 10, 0.9 to 9, and so forth. The baseline of method can be regarded as pure SVD at preservation rate 1.0.

We can see from Tables 3 and 4 that for both English and Chinese similarity measure, SVDC with -Means, SVDC with SOMs clustering, and SVD outperform other SVD based methods. In most cases, SVDC with -Means and SVDC with SOMs clustering have better performances than SVD. This outcome validates our motivation of SVD on clusters in Section 3.1 that all documents in a corpus are not necessarily to be in a same latent space but in some different latent subspaces. Thus, SVD on clusters, which constructs latent subspaces on document clusters, can characterize document similarity more accurately and appropriately than other SVD based methods. Here, we regard that the variances of the mentioned methods are comparable to each other because they have similar values.

Considering the variances of average precisions on different categories, we admit that SVDC may not be a robust approach since its superiority is not obvious than SVD (as pointed out by one of the reviewers). However, we regard that the variances of the mentioned methods are comparable to each other because they have similar values.

Moreover, SVDC with -Means outperforms SVDC with SOMs clustering. The better performance of SVDC with -Means can be attributed to the better performance of -Means than SOMs in clustering (see Table 2). When preservation rate declines from 1 to 0.1, the performances of SVDC with -Means and SVD increase significantly. However, for SVDC with SOMs clustering, its performance decreases when preservation is smaller than 0.3. We hypothesize that SVDC with -Means has effectively captured latent structure of documents but SVDC with SOMs clustering has not captured the appropriate latent structure due to its poor capacity in document clustering.

To better illustrate the effectiveness of each method, the classic -test is employed [27, 28]. Tables 5 and 6 demonstrate the results of -test of the performances of the examined methods on English and Chinese documents, respectively. The following codification of value in ranges was used: “” (“”) means that value is lesser than or equal to 0.01, indicating a strong evidence that a method produces a significant better (worse) similarity measure than another one; “” (“”) means that value is larger than 0.01 and minor or equal to 0.05, indicating a weak evidence that a method produces a significant better (worse) similarity measure than another one; “” means that value is greater than 0.05, indicating that the compared methods do not have significant differences in performances. We can see that SVDC with -Means outperforms both SVDC with SOMs clustering and pure SVD in both English and Chinese corpus. Meanwhile, SVDC with SOMs clustering has a very similar performance with pure SVD.

Table 5: Results of -test on the performances of similarity measure of SVD on clusters and other SVD based LSI methods in English corpus.
Table 6: Results of -test on the performances of similarity measure of SVD on clusters and other SVD based LSI methods in Chinese corpus.
4.4. Experimental Results of Updating

Figure 1 is the performances of updating process of SVD on clusters in comparison with SVD updating. The vertical axis indicates average precision, and the horizontal axis indicates the retaining ratio of original documents for initial SVDC or SVD approximation. For example, the retaining ratio 0.8 indicates that 80 percentage of documents (terms) in the corpus are used for approximation and the left 20 percentage of documents (terms) are used for updating the approximation matrix. Here, the preservation rates of approximation matrices are set as 0.8 uniformly. We only compared SVDC with -Means and SVD in updating because SVDC with SOMs clustering has not produced a competitive performance in similarity measure.

Figure 1: Similarity measure of SVDC with -Means and SVD for updating; the preservation rates of their approximation matrices are set as 0.8.

We can see from Figure 1 that, in folding in new documents, the updating process of SVDC with -Means is superior to SVD updating on similarity measure. An obvious trend on their performance difference is that the superiority of SVDC with -Means becomes more and more significant than SVD when the number of training documents declines. We conjecture that less diversity in latent spaces of small number of training documents can improve the document similarity in the same category.

In folding in new terms, SVDC with -Means is superior to SVD as well. However, their performances drop dramatically in initial phase and increase after a critical value. This phenomenon can be explained as that when retaining ratio is large, the removal of more and more index terms from term-document matrix will hurt the latent structure of document space. However, when retaining ratio attains to a small value (the critical value), the latent structure of document space is decided principally by the appended terms which have larger number than remaining terms. For this reason, document similarities in the corpus are determined by the appended index terms. Furthermore, we observe that the critical value on Chinese corpus is larger than that on English corpus. This can be explained as that the number of Chinese index terms (21475) is much larger than that of English index terms (3269) but the number of Chinese documents (1200) is smaller than that of English documents (2402). Thus, the structure of Chinese latent space is much more robust than that of English latent space which is very sensitive to the number of index terms.

5. Concluding Remarks

This paper proposes SVD on clusters as a new indexing method for Latent Semantic Indexing. Based on the review on current trend of linear algebraic methods for LSI, we claim that the state of art of LSI roughly follows two disciplines: SVD based LSI methods and non-SVD based LSI methods. Then, with the specification of its motivation, SVD on clusters is proposed. We describe the algorithm of SVD on clusters with two different clustering algorithms: -Means and SOMs clustering. The computation complexity of SVD on clusters, its theoretical analysis, and its updating process for folding in new documents and terms are presented in this paper. SVD on clusters is different from existing SVD based LSI methods in the way of eliminating noise from the term-document matrix. It neither changes the weights of singular values in as done in SVR and ADE nor revises directions of singular vectors as done in IRR. It adapts the structure of the original term-document matrix based on document clusters. Finally, two document collections as a Chinese and an English corpus are used to evaluate the proposed methods using similarity measure in comparison with other SVD based LSI methods. Experimental results demonstrate that in most cases SVD on clusters outperforms other SVD based LSI methods. Moreover, the performances of clustering techniques used in SVD on clusters play an important role on its performances.

The possible applications of SVD on clusters may be automatic categorization of large amount of Web documents where LSI is an alternative for document indexing but with huge computation complexity and the refinement of document clustering where interdocument similarity measure is decisive for its performance. We admit that this paper covers merely linear algebra methods for latent sematic indexing. In the future, we will compare SCD on clusters with the topic based methods for Latent Semantic Indexing on interdocument similarity measure, such as Probabilistic Latent Semantic Indexing [29] and Latent Dirichlet Allocation [30].

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research was supported in part by National Natural Science Foundation of China under Grants nos. 71101138, 61379046, 91218301, 91318302, and 61432001; Beijing Natural Science Fund under Grant no. 4122087; the Fundamental Research Funds for the Central Universities (buctrc201504).

References

  1. C. White, “Consolidating, accessing and analyzing unstructured data,” http://www.b-eye-network.com/view/2098.
  2. R. Rahimi, A. Shakery, and I. King, “Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework,” Information Processing & Management, vol. 52, no. 2, pp. 299–318, 2016. View at Publisher · View at Google Scholar · View at Scopus
  3. M. W. Berry, S. T. Dumais, and G. W. O'Brien, “Using linear algebra for intelligent information retrieval,” SIAM Review, vol. 37, no. 4, pp. 573–595, 1995. View at Publisher · View at Google Scholar · View at MathSciNet
  4. M. T. Hassan, A. Karim, J.-B. Kim, and M. Jeon, “CDIM: document clustering by discrimination information maximization,” Information Sciences, vol. 316, no. 20, pp. 87–106, 2015. View at Publisher · View at Google Scholar · View at Scopus
  5. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391–407, 1990. View at Publisher · View at Google Scholar
  6. C. Laclau and M. Nadif, “Hard and fuzzy diagonal co-clustering for document-term partitioning,” Neurocomputing, vol. 193, pp. 133–147, 2016. View at Publisher · View at Google Scholar
  7. G. H. Golub and C. F. von Loan, Matrix Computations, The John Hopkins University Press, 3rd edition, 1996.
  8. L. Yue, W. Zuo, T. Peng, Y. Wang, and X. Han, “A fuzzy document clustering approach based on domain-specified ontology,” Data and Knowledge Engineering, vol. 100, pp. 148–166, 2015. View at Publisher · View at Google Scholar · View at Scopus
  9. R. K. Ando, “Latent semantic space: iterative scaling imrpoves precision of inter-document similarity measurement,” in Proceedings of the 23rd ACM International SIGIR Conference on Research and Development in Information Retrieval (SIGIR '00), pp. 216–223, Athens, Greece, July 2000.
  10. H. Yan, W. I. Grosky, and F. Fotouhi, “Augmenting the power of LSI in text retrieval: singular value rescaling,” Data and Knowledge Engineering, vol. 65, no. 1, pp. 108–125, 2008. View at Publisher · View at Google Scholar · View at Scopus
  11. F. Jiang and M. L. Littman, “Approximate dimension equalization in vector-based information retrieval,” in Proceedings of the 17th International Conference on Machine Learning (ICML '00), pp. 423–430, Stanford, Calif, USA, 2000.
  12. T. G. Kolda and D. P. O'Leary, “A semidiscrete matrix decomposition for latent semantic indexing in information retrieval,” ACM Transactions on Information Systems, vol. 16, no. 4, pp. 322–346, 1998. View at Publisher · View at Google Scholar · View at Scopus
  13. X. He, D. Cai, H. Liu, and W. Y. Ma, “Locality preserving indeixng for document reprenentation,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 218–225, 2004.
  14. E. P. Jiang and M. W. Berry, “Information filtering using the Riemannian SVD (R-SVD),” in Solving Irregularly Structured Problems in Parallel: 5th International Symposium, IRREGULAR'98 Berkeley, California, USA, August 9–11, 1998 Proceedings, vol. 1457 of Lecture Notes in Computer Science, pp. 386–395, 2005. View at Google Scholar
  15. M. Welling, Fisher Linear Discriminant Analysis, http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf.
  16. J. Gao and J. Zhang, “Clustered SVD strategies in latent semantic indexing,” Information Processing and Management, vol. 41, no. 5, pp. 1051–1063, 2005. View at Publisher · View at Google Scholar · View at Scopus
  17. V. Castelli, A. Thomasian, and C.-S. Li, “CSVD: clustering and singular value decomposition for approximate similarity search in high-dimensional spaces,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 671–685, 2003. View at Publisher · View at Google Scholar · View at Scopus
  18. M. W. Berry, “Large scale singular value computations,” International Journal of Supercomputer Applications, vol. 6, pp. 13–49, 1992. View at Google Scholar
  19. C. D. Manning and H. Schutze, Foundations of Statisitcal Natural Language Processing, The MIT Press, 4th edition, 2001.
  20. G. Salton, A. Wang, and C. S. Yang, “A vector space model for information retrieval,” Journal of American Society for Information Science, vol. 18, no. 11, pp. 613–620, 1975. View at Google Scholar
  21. L. Jiang, C. Li, S. Wang, and L. Zhang, “Deep feature weighting for naive Bayes and its application to text classification,” Engineering Applications of Artificial Intelligence, vol. 52, pp. 26–39, 2016. View at Publisher · View at Google Scholar
  22. T. Van Phan and M. Nakagawa, “Combination of global and local contexts for text/non-text classification in heterogeneous online handwritten documents,” Pattern Recognition, vol. 51, pp. 112–124, 2016. View at Publisher · View at Google Scholar · View at Scopus
  23. H. Zha, O. Marques, and H. D. Simon, “Large scale SVD and subspace-based methods for information retrieval,” in Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel (IRREGULAR '98), pp. 29–42, Berkeley, Calif, USA, August 1998.
  24. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, Boston, Mass, USA, 2nd edition, 2006.
  25. T. Kohonen, Self-Organization and Associative Memory, vol. 8 of Springer Series in Information Sciences, Springer, New York, NY, USA, 2nd edition, 1988. View at Publisher · View at Google Scholar · View at MathSciNet
  26. M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in Proceedings of the KDD Workshop on Text Mining, pp. 109–110, 2000.
  27. Y. M. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings on the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99), pp. 42–49, Berkeley, Calif, USA, August 1999.
  28. R. F. Corrêa and T. B. Ludermir, “Improving self-organization of document collections by semantic mapping,” Neurocomputing, vol. 70, no. 1–3, pp. 62–69, 2006. View at Publisher · View at Google Scholar · View at Scopus
  29. T. Hofmann, “Learning the similarity of documents: an information-geometric approach to document retrieval and categorization,” in Advances in Neural Information Processing Systems 12, pp. 914–920, The MIT Press, 2000. View at Google Scholar
  30. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003. View at Google Scholar · View at Scopus