Abstract

The latent semantic analysis (LSA) is a mathematical/statistical way of discovering hidden concepts between terms and documents or within a document collection (i.e., a large corpus of text). Each document of the corpus and terms are expressed as a vector with elements corresponding to these concepts to form a term-document matrix. Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection. This is an essential step in the LSA. The singular value decomposition (SVD) has been the main tool obtaining the low-rank approximation in the LSA. Since the document collection is dynamic (i.e., the term-document matrix is subject to repeated updates), we need to renew the approximation. This can be done via recomputing the SVD or updating the SVD. However, the computational time of recomputing or updating the SVD of the term-document matrix is very high when adding new terms and/or documents to preexisting document collection. Therefore, this issue opened the door of using other matrix decompositions for the LSA as ULV- and URV-based decompositions. This study shows that the truncated ULV decomposition (TULVD) is a good alternative to the SVD in the LSA modeling.

1. Introduction

The latent semantic analysis (LSA) is a mathematical/statistical method which is used for discovering the existing latent relationships between terms and documents or within a collection of documents (i.e., a large corpus of text) [1]. Although the LSA works especially well on textual data, it has been very popular in the academic community recently because of its wide variety of practices in the content information [2], sociological discourse analysis [3], image retrieval systems [4], human cognition, and human learning [5]. The LSA can be applied to any collection of documents that is cleaned from the syntactical and grammatical structure. If the collection of documents contains m terms and n documents, it is represented by using a matrix A of dimension and called the term-document matrix.

The LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection [6].

The low-rank approximation of the term-document matrix A is, for a positive constant , the matrix that satisfieswhere represents either two-norm or Frobenius-norm. The existence of such a matrix follows from the singular value decomposition (SVD) of A. Moreover, with no doubt, the truncated singular value decomposition is the main tool for solving the minimization problem given by (1). However, in the LSA where document collections are dynamic over time, i.e., the term-document matrix is subject to repeated updates, the SVD becomes prohibitive due to the high computational expense. Thus, alternative decompositions have been proposed for these applications such as low-rank ULV/URV decompositions [7] and truncated ULV decomposition (TULVD) [8]. Recall that, the initial computing cost for the low-rank ULV/URV decompositions and the TULVD lower is than the SVD [7].

The manuscript demonstrates that the TULVD is a good substitute for the SVD in the LSA modeling.

The rest of the manuscript is organized as follows. In Section 2, we introduce some notations and cover critical background materials in numerical linear algebra. In Section 3, we give the main steps of our LSA modeling. Then in Section 4, we test our model using some commonly used test collections and present some simulation results. In Section 5, we comment on simulation results.

2. Notations and Background

2.1. Notations

Throughout the paper, uppercase letters such as A denote matrices. The identity matrix is denoted by . Moreover, the norm denotes the spectral norm, and denotes the Frobenius norm. The notation represents the set of real matrices. An dimensional matrix A is represented as where is the entry of A at i row and j column with and .

2.2. Orthogonal Matrix Decompositions

Definition 1 (the singular value decomposition). For a matrix with , the singular value decomposition (SVD) iswhere the left and right singular matrices W and Y are orthogonal matrices and where is a diagonal matrix with the following order:The diagonal entries of are called the singular values of A.
For a given positive integer , we block partition the SVD in (2) aswhere and are diagonal matrices containing the k largest and smallest singular values of A, respectively, (Figure 1). The matrix defined byis called rank k matrix approximation to A. For some tolerance, is proportional to the machine unit, and if the singular values satisfythen the value k is called the numerical rank of the matrix A. However, we are aware that the determination of the numerical rank is a sensitive computation, especially when there is no well-defined gap between singular values [9, 10]. Moreover, in some situations, like the example in Section 5.4.1 of [10], the tolerance is chosen slightly bigger. The time complexity of obtaining is [7].
Even though the SVD provides accurate subspaces, as we mentioned above, it is not suitable for dynamic problems where data changes (i.e., update and/or downdate) due to high computational demand in both dense [11] or sparse matrices [12]. Herein, we consider ULV-based TULVD for approximating the matrix subspaces.

Definition 2 (the truncated ULV decomposition). For a matrix with the numerical rank , the TULVD iswhere is nonsingular lower triangular matrix, and are left orthogonal matrices (i.e., ), and is an error matrix.
The matrices L and E satisfyTo meet these conditions, instead of minimizing , keep it as small as feasible and always enforce the constraint on thereby biasing the algorithm to favor a small approximation error over ideal conditioning of L.
The TULVD provides both rank and good approximate subspaces of the matrix [8, 13] and differs from the ULV decompositions in [14, 15] in two significant respects:(1)The matrix E is not stored, instead, is maintained. However, by equations in (10) and (11), we are able to compute the projections or using A and U; in addition, computational tools for computing them are provided in Section 2 of [8].(2)A is either sparse or structured, the matrix-vector multiplications or require less than . For example, when A is sparse, the computational complexity of the matrix-vector product is .

Proposition 1. Let be a TULVD of the matrix with rank k. Thenwhere is the pseudoinverse of U.

Proof. See [8].
Then, it follows thatThe computation of TULVD of the matrix A requires operations where represents the average work to estimate the principal singular triplets of A and takes operations [7].

3. LSA Modeling

The LSA relies on some existing latent structure in word usage in the corpus. It uses statistically derived conceptual indices instead of individual words for retrieval. Thus, it overcomes the problems of synonymy and polysemy in lexical matching retrieval methods [16]. Note that the LSA is an unsupervised learning method.

The main steps of the LSA modeling is outlined in Algorithm 1. The LSA modeling algorithm inputs the corpus of at least two or more monolingual textual documents. The documents may be of different types such as medical, educational, computer science, and social science. Moreover, the corpus may be in any language.

% input:
 % monolingual textual corpus
% output:
 % graph of terms and documents
% read the corpus, parse it, and execute the morphological step,
% weight the terms and obtain the term-document matrix
doc_file = ‘files/document.txt’;
stopword_file = ‘files/stopword_en.txt’;
A = doc_term_mat (doc_file, stopword_file, weight);
% apply the SVD decomposition
[W, Sigma, Y] = svd(A);
% obtain the components of by the SVD
W_k = W (:, 1 : k);
Sigma_k = Sigma (1 : k, 1 : k);
Y_k = Y(:, 1 : k);
% apply the TULVD
[U, L, V] = TULV(A);
% obtain the components of A_k by the TULVD
U_k = U; L_k = L; V_k = V;
% find term and document vectors in k-space
term_vec = U_k ∗ L_k
doc_vec = L_k ∗ transpose (V_k)
% represent the query q in k-space
query_vec = query (q, U_k, L_k)
% find semantic relationship in the corpus using cosine similarity
semantic_sim = cosine_sim (query_vec, doc_vec)

In the LSA modeling, a vector space, so-called “semantic space,” representation of the document collection is typically computed and then the inner product or cosine between the user query vector and/or document vectors is used as a measure of similarity between the documents. We note that the similarity estimates derived by the LSA are not simple contiguity frequencies, co-occurrence counts, or correlations in usage but depend on a powerful mathematical/statistical analysis that is capable of correctly inferring much deeper relations [17].

In the following, the main steps of LSA modeling in Algorithm 1 are explained.

In obtaining the term-document matrix step, first, each document in the corpus is cleaned from the syntactical and grammatical structure in order to improve the LSA performance both effectively and efficiently. In the term-document matrix, each row stands for a unique word and each column stands for a document in the corpus. Each entry expresses both the word’s importance in the particular document and the degree to which the word carries information in the corpus in general. Mathematically, the value of the entry at the ith row and jth column of the document-term matrix A is given bywhere represents the local weight of the word i in the document j and represents the global weight of the word i. There are different local and general weighting methods defined in the literature; however, in this study, term frequency () for local weighting and inverse document frequency () for overall weighting methods are used to calculate the element of the term-document matrix A. Other weighting schemes [18, 19] can be applied to increase/decrease the importance of terms within and/or among documents. Here we have to note that equation (12) does not take the word order into account.

The main step in LSA modeling is to obtain the low-rank approximation of the term-document matrix A. It is computed via both the SVD and the truncated ULV decomposition. In the former case, the rank k low-dimensional approximation is the matrix given in equation (5), whereas in the latter case, it is derived from equation (7). To be more precise, the singular values L are close to the k-largest singular values of A, while the columns of matrices U and V are the corresponding approximate left and right singular vectors of A, respectively [8]. Recall that the LSA attempts to retrieve a small number of concepts are important for representing the corpus. The matrix L indicates these concepts as good as , especially when L is diagonally dominant [20]. On the contrary, in the LSA applications, the number of important concepts is much smaller than both the number of terms and documents, i.e., .

Note that in reduced k-dimensional semantic space, the rows of the matrix and represent the terms, but the columns of the matrix and represent the documents in the corpus.

3.1. Query

A query consists of words and is considered as a document and represented in vector space. In other words, the query composed of the words entered by the user is translated to an dimensional vector q by using the same weighting process used to construct the term-document matrix. Then, the query vector q is represented in the SVD-based vector space given by equation (5) asand the TULVD-based vector space given by equation (7) as

By representing the query vector in the corresponding vector space, all document vectors existing in the vector space can be compared with the query vector and sorted by the similarity rank.

The determination of the exact similarity measurement method is very important in terms of classification of documents and performance of information retrieval [21]. In literature, different similarity measures are defined such as Euclidean distance, cosine similarity, Jaccard coefficient, Pearson correlation coefficient, and averaged Kullback–Leibler divergence [22]. In this study, we prefer cosine similarity.

4. Application

To test our LSA model, we make use of three commonly used information retrieval test collections. These collections are American Documentation Institute Reports (ADI), a collection of articles published in Time magazine (TIME), and a collection of medline articles (MED). Each of these test collections contain a set of short articles, a set of queries, and a list indicating which documents are relevant to which queries. The performance of the model is evaluated on the basis of this list.

For each collection, stopwords in the documents are cleaned and stemming is applied before the term-document matrix of the collection is created. Table 1 presents some statistics before and after the preprocessing along with the number of queries.

Thus, the term-document matrix A is of size for the ADI, for the TIME, and for the MED, and its entries are obtained by using equation (12). The local weight of the ith word in the jth document is obtained by , and the overall weighting of the ith word is by , that is,where is the frequency of the word i within document j; on the contrary,where is the frequency of documents in which the word i appears at least once.

The term-document matrix for the MED collection is given in Table 2. Note that as mentioned above, this matrix is a sparse matrix.

The distribution of terms in the data set using the SVD is given in Figure 2 and using the TULVD in Figure 3. On the contrary, the distribution of documents using the SVD is given in Figure 4 and using the TULVD in Figure 5.

When the distributions of terms and documents for both algorithms are visually examined, it is seen that those realized by the TULVD are distributed over a wider area. However, when analyzed as an angular perspective, it is observed that the distributions given for both algorithms are not identical but show a similar distribution.

Now, we quantitatively compare the performance of the SVD-based information retrieval and the TULVD-based information retrieval. The standard metrics are “recall” and “precision.” Before we remind their mathematical definitions, we define some variables. We let A be the set of documents returned as a result of the query information extraction and relevant to query. Moreover, let B represent all the documents accessed in the query result, and C represent all the documents related to the query in the corpus. Then, the recall is defined asand the precision is defined as

A detailed explanation of these quantitative metrics can be found in [2325].

Tables 3, 4, and 5 show the results of the SVD and TULVD methods for the ADI, the MED, and the TIME collections, respectively, according to different k values. In the tables, instead of taking all the returned documents after the query, we only use 10% and 50% to obtain quantitative metrics. Precision shows the average success of indexed documents for all queries in these slices. In addition, Min Cosine Similarity Value shows the average minimum cosine similarity value of the documents listed in the query result.

Table 3 shows the performance of the ADI collection according to the rank k value. For , the performance of both methods is poor. Indexing accuracy increases until k is between 50 and 60. However, when k is greater than 70, the indexing accuracy begins to decrease. Similarly, Table 4 shows the performance of the MED collection according to the k value, and it is seen that the performance of both methods is poor for , increases for k between 20 and 150 and decreases for . Finally, the performance analysis of the TIME collection is given in Table 5. The performance of both methods is poor for , increases when k is between 100 and 200, and decreases when k is greater than 200.

Tables 6, 7, and 8 show the success of the SVD and the TULVD methods applied to the ADI, the MED, and the TIME collections, respectively, according to the similarity threshold value. The results in the tables are obtained by taking the average of the achievement of the documents listed for all the test questions in the collections. All of the returned documents are taken into account while calculating the results.

In these tables, performance values corresponding to a certain cosine threshold value are shown for both methods. As the cosine threshold increases, in general, the recall decreases, but the precision increases. Moreover, decreasing cosine threshold value increases the number of documents returned as a query result. However, the proportion of related documents in these documents increases.

The coordinates of the documents in the vector space obtained in the LSA process are used to list those that are similar to the query clauses. Figure 6 shows the successful indexing of documents in vector space created using the SVD and the TULVD for the ADI, the MED, and the TIME collections. All documents returned for each query are listed in the decreasing order of similarity. The accuracy of these documents is calculated in percentiles. Precision decreases as percentile slice rate increases, but retrieval to related documents is increasing. As with other results, Figure 6 also reflects the average of the performance metrics of the documents listed for all of the queries in each collection.

Figure 7 compares the SVD and TULVD methods for the three collections used in the testing process with the average minimum similarity values of the documents listed at the end of the indexing process with different rank values. The results for the MED and the TIME collections seem to be similar. In the ADI collection, it is seen that the similarity change rate is almost the same, although it has different values according to the increasing rank values. This difference can be attributed to the fact that the number of documents in the MED and the TIME collections is much higher than the ADI collection.

Figure 7 shows the minimum cosine similarity value change for the ADI, the MED, and the TIME collections with respect to k. Note that as the number k increases for the three collections, the rate of change of the minimum cosine similarity for successive steps decrease differently. Thus, it makes difficult to determine the similarity threshold used in document indexing and prevents successful indexing. For this reason, it is recommended to obtain values in which the rate of change of the value of k is high and the success of document indexing is good in order to increase the performance of the process and to retrieve the correct documents. In this case, k should be 50 for the ADI collection, 150 for the MED collection, and 150 for the TIME collection.

On the contrary, Figures 8, 9, and 10 illustrate the indexing success of the semantic vector space generated by both methods for the ADI, the MED, and the TIME collections, respectively, according to different k values.

5. Conclusion

According to the visual observation of the simulations presented in Figures 2, 3, 4, and 5 as well as the quantitative measurement of the recall and the precision given in Tables 3, 4, and 5, the TULVD can be a good substitute for the SVD to find the rank k approximation in the LSA modeling. To support our claim, in addition, when we examine Tables 6, 7, and 8, we observe that the TULVD-based LSA model produces results similar to the SVD-based LSA model for different cosine threshold values in these three collections. Moreover, Figures 8, 9, and 10 are examined, and the best k value for the SVD and TULVD application in the ADI collection is 50, in the MED collection 100, and in the TIME collection 200.

In result, the TULVD is as good as the SVD for retrieving the semantic structure of the textual document in the LSA modeling. The main advantage of the TULVD over the SVD is the efficient computation of the initial low-rank approximation as well as the efficient computation of the low-rank approximation when adding a new document and/or term to the existing LSA generated database, i.e., “updating”.

As a result of our experience in this study, it is thought that TULVD can be used as an alternative method in many areas of SVD used such as data compression, missing data completion, image processing, sound processing, noisy data cleaning, and especially signal processing. In addition, this study can be extended to cover fields such as text summarization, text similarity, keyword extraction, author detection, and text classification.

Data Availability

The ATI, TIME, and MEDLINE collections, which are well-known datasets, used to support the findings of this study have been obtained from the Glasgow Repository (http://ir.dcs.gla.ac.uk/resources/test_collections/).

Disclosure

Initial results of this study were presented in the 21st International Conference Mathematical Modeling and Analysis.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported with project 2016/150 by Kırıkkale University Scientific Research Projects (BAP).