Research Article | Open Access
Alternate Low-Rank Matrix Approximation in Latent Semantic Analysis
The latent semantic analysis (LSA) is a mathematical/statistical way of discovering hidden concepts between terms and documents or within a document collection (i.e., a large corpus of text). Each document of the corpus and terms are expressed as a vector with elements corresponding to these concepts to form a term-document matrix. Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection. This is an essential step in the LSA. The singular value decomposition (SVD) has been the main tool obtaining the low-rank approximation in the LSA. Since the document collection is dynamic (i.e., the term-document matrix is subject to repeated updates), we need to renew the approximation. This can be done via recomputing the SVD or updating the SVD. However, the computational time of recomputing or updating the SVD of the term-document matrix is very high when adding new terms and/or documents to preexisting document collection. Therefore, this issue opened the door of using other matrix decompositions for the LSA as ULV- and URV-based decompositions. This study shows that the truncated ULV decomposition (TULVD) is a good alternative to the SVD in the LSA modeling.
The latent semantic analysis (LSA) is a mathematical/statistical method which is used for discovering the existing latent relationships between terms and documents or within a collection of documents (i.e., a large corpus of text) . Although the LSA works especially well on textual data, it has been very popular in the academic community recently because of its wide variety of practices in the content information , sociological discourse analysis , image retrieval systems , human cognition, and human learning . The LSA can be applied to any collection of documents that is cleaned from the syntactical and grammatical structure. If the collection of documents contains m terms and n documents, it is represented by using a matrix A of dimension and called the term-document matrix.
The LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection .
The low-rank approximation of the term-document matrix A is, for a positive constant , the matrix that satisfieswhere represents either two-norm or Frobenius-norm. The existence of such a matrix follows from the singular value decomposition (SVD) of A. Moreover, with no doubt, the truncated singular value decomposition is the main tool for solving the minimization problem given by (1). However, in the LSA where document collections are dynamic over time, i.e., the term-document matrix is subject to repeated updates, the SVD becomes prohibitive due to the high computational expense. Thus, alternative decompositions have been proposed for these applications such as low-rank ULV/URV decompositions  and truncated ULV decomposition (TULVD) . Recall that, the initial computing cost for the low-rank ULV/URV decompositions and the TULVD lower is than the SVD .
The manuscript demonstrates that the TULVD is a good substitute for the SVD in the LSA modeling.
The rest of the manuscript is organized as follows. In Section 2, we introduce some notations and cover critical background materials in numerical linear algebra. In Section 3, we give the main steps of our LSA modeling. Then in Section 4, we test our model using some commonly used test collections and present some simulation results. In Section 5, we comment on simulation results.
2. Notations and Background
Throughout the paper, uppercase letters such as A denote matrices. The identity matrix is denoted by . Moreover, the norm denotes the spectral norm, and denotes the Frobenius norm. The notation represents the set of real matrices. An dimensional matrix A is represented as where is the entry of A at i row and j column with and .
2.2. Orthogonal Matrix Decompositions
Definition 1 (the singular value decomposition). For a matrix with , the singular value decomposition (SVD) iswhere the left and right singular matrices W and Y are orthogonal matrices and where is a diagonal matrix with the following order:The diagonal entries of are called the singular values of A.
For a given positive integer , we block partition the SVD in (2) aswhere and are diagonal matrices containing the k largest and smallest singular values of A, respectively, (Figure 1). The matrix defined byis called rank k matrix approximation to A. For some tolerance, is proportional to the machine unit, and if the singular values satisfythen the value k is called the numerical rank of the matrix A. However, we are aware that the determination of the numerical rank is a sensitive computation, especially when there is no well-defined gap between singular values [9, 10]. Moreover, in some situations, like the example in Section 5.4.1 of , the tolerance is chosen slightly bigger. The time complexity of obtaining is .
Even though the SVD provides accurate subspaces, as we mentioned above, it is not suitable for dynamic problems where data changes (i.e., update and/or downdate) due to high computational demand in both dense  or sparse matrices . Herein, we consider ULV-based TULVD for approximating the matrix subspaces.
Definition 2 (the truncated ULV decomposition). For a matrix with the numerical rank , the TULVD iswhere is nonsingular lower triangular matrix, and are left orthogonal matrices (i.e., ), and is an error matrix.
The matrices L and E satisfyTo meet these conditions, instead of minimizing , keep it as small as feasible and always enforce the constraint on thereby biasing the algorithm to favor a small approximation error over ideal conditioning of L.
The TULVD provides both rank and good approximate subspaces of the matrix [8, 13] and differs from the ULV decompositions in [14, 15] in two significant respects:(1)The matrix E is not stored, instead, is maintained. However, by equations in (10) and (11), we are able to compute the projections or using A and U; in addition, computational tools for computing them are provided in Section 2 of .(2)A is either sparse or structured, the matrix-vector multiplications or require less than . For example, when A is sparse, the computational complexity of the matrix-vector product is .
Proposition 1. Let be a TULVD of the matrix with rank k. Thenwhere is the pseudoinverse of U.
Proof. See .
Then, it follows thatThe computation of TULVD of the matrix A requires operations where represents the average work to estimate the principal singular triplets of A and takes operations .
3. LSA Modeling
The LSA relies on some existing latent structure in word usage in the corpus. It uses statistically derived conceptual indices instead of individual words for retrieval. Thus, it overcomes the problems of synonymy and polysemy in lexical matching retrieval methods . Note that the LSA is an unsupervised learning method.
The main steps of the LSA modeling is outlined in Algorithm 1. The LSA modeling algorithm inputs the corpus of at least two or more monolingual textual documents. The documents may be of different types such as medical, educational, computer science, and social science. Moreover, the corpus may be in any language.
In the LSA modeling, a vector space, so-called “semantic space,” representation of the document collection is typically computed and then the inner product or cosine between the user query vector and/or document vectors is used as a measure of similarity between the documents. We note that the similarity estimates derived by the LSA are not simple contiguity frequencies, co-occurrence counts, or correlations in usage but depend on a powerful mathematical/statistical analysis that is capable of correctly inferring much deeper relations .
In the following, the main steps of LSA modeling in Algorithm 1 are explained.
In obtaining the term-document matrix step, first, each document in the corpus is cleaned from the syntactical and grammatical structure in order to improve the LSA performance both effectively and efficiently. In the term-document matrix, each row stands for a unique word and each column stands for a document in the corpus. Each entry expresses both the word’s importance in the particular document and the degree to which the word carries information in the corpus in general. Mathematically, the value of the entry at the ith row and jth column of the document-term matrix A is given bywhere represents the local weight of the word i in the document j and represents the global weight of the word i. There are different local and general weighting methods defined in the literature; however, in this study, term frequency () for local weighting and inverse document frequency () for overall weighting methods are used to calculate the element of the term-document matrix A. Other weighting schemes [18, 19] can be applied to increase/decrease the importance of terms within and/or among documents. Here we have to note that equation (12) does not take the word order into account.
The main step in LSA modeling is to obtain the low-rank approximation of the term-document matrix A. It is computed via both the SVD and the truncated ULV decomposition. In the former case, the rank k low-dimensional approximation is the matrix given in equation (5), whereas in the latter case, it is derived from equation (7). To be more precise, the singular values L are close to the k-largest singular values of A, while the columns of matrices U and V are the corresponding approximate left and right singular vectors of A, respectively . Recall that the LSA attempts to retrieve a small number of concepts are important for representing the corpus. The matrix L indicates these concepts as good as , especially when L is diagonally dominant . On the contrary, in the LSA applications, the number of important concepts is much smaller than both the number of terms and documents, i.e., .
Note that in reduced k-dimensional semantic space, the rows of the matrix and represent the terms, but the columns of the matrix and represent the documents in the corpus.
A query consists of words and is considered as a document and represented in vector space. In other words, the query composed of the words entered by the user is translated to an dimensional vector q by using the same weighting process used to construct the term-document matrix. Then, the query vector q is represented in the SVD-based vector space given by equation (5) asand the TULVD-based vector space given by equation (7) as
By representing the query vector in the corresponding vector space, all document vectors existing in the vector space can be compared with the query vector and sorted by the similarity rank.
The determination of the exact similarity measurement method is very important in terms of classification of documents and performance of information retrieval . In literature, different similarity measures are defined such as Euclidean distance, cosine similarity, Jaccard coefficient, Pearson correlation coefficient, and averaged Kullback–Leibler divergence . In this study, we prefer cosine similarity.
To test our LSA model, we make use of three commonly used information retrieval test collections. These collections are American Documentation Institute Reports (ADI), a collection of articles published in Time magazine (TIME), and a collection of medline articles (MED). Each of these test collections contain a set of short articles, a set of queries, and a list indicating which documents are relevant to which queries. The performance of the model is evaluated on the basis of this list.
For each collection, stopwords in the documents are cleaned and stemming is applied before the term-document matrix of the collection is created. Table 1 presents some statistics before and after the preprocessing along with the number of queries.
Thus, the term-document matrix A is of size for the ADI, for the TIME, and for the MED, and its entries are obtained by using equation (12). The local weight of the ith word in the jth document is obtained by , and the overall weighting of the ith word is by , that is,where is the frequency of the word i within document j; on the contrary,where is the frequency of documents in which the word i appears at least once.
The term-document matrix for the MED collection is given in Table 2. Note that as mentioned above, this matrix is a sparse matrix.
The distribution of terms in the data set using the SVD is given in Figure 2 and using the TULVD in Figure 3. On the contrary, the distribution of documents using the SVD is given in Figure 4 and using the TULVD in Figure 5.
When the distributions of terms and documents for both algorithms are visually examined, it is seen that those realized by the TULVD are distributed over a wider area. However, when analyzed as an angular perspective, it is observed that the distributions given for both algorithms are not identical but show a similar distribution.
Now, we quantitatively compare the performance of the SVD-based information retrieval and the TULVD-based information retrieval. The standard metrics are “recall” and “precision.” Before we remind their mathematical definitions, we define some variables. We let A be the set of documents returned as a result of the query information extraction and relevant to query. Moreover, let B represent all the documents accessed in the query result, and C represent all the documents related to the query in the corpus. Then, the recall is defined asand the precision is defined as
Tables 3, 4, and 5 show the results of the SVD and TULVD methods for the ADI, the MED, and the TIME collections, respectively, according to different k values. In the tables, instead of taking all the returned documents after the query, we only use 10% and 50% to obtain quantitative metrics. Precision shows the average success of indexed documents for all queries in these slices. In addition, Min Cosine Similarity Value shows the average minimum cosine similarity value of the documents listed in the query result.
Table 3 shows the performance of the ADI collection according to the rank k value. For , the performance of both methods is poor. Indexing accuracy increases until k is between 50 and 60. However, when k is greater than 70, the indexing accuracy begins to decrease. Similarly, Table 4 shows the performance of the MED collection according to the k value, and it is seen that the performance of both methods is poor for , increases for k between 20 and 150 and decreases for . Finally, the performance analysis of the TIME collection is given in Table 5. The performance of both methods is poor for , increases when k is between 100 and 200, and decreases when k is greater than 200.
Tables 6, 7, and 8 show the success of the SVD and the TULVD methods applied to the ADI, the MED, and the TIME collections, respectively, according to the similarity threshold value. The results in the tables are obtained by taking the average of the achievement of the documents listed for all the test questions in the collections. All of the returned documents are taken into account while calculating the results.
In these tables, performance values corresponding to a certain cosine threshold value are shown for both methods. As the cosine threshold increases, in general, the recall decreases, but the precision increases. Moreover, decreasing cosine threshold value increases the number of documents returned as a query result. However, the proportion of related documents in these documents increases.
The coordinates of the documents in the vector space obtained in the LSA process are used to list those that are similar to the query clauses. Figure 6 shows the successful indexing of documents in vector space created using the SVD and the TULVD for the ADI, the MED, and the TIME collections. All documents returned for each query are listed in the decreasing order of similarity. The accuracy of these documents is calculated in percentiles. Precision decreases as percentile slice rate increases, but retrieval to related documents is increasing. As with other results, Figure 6 also reflects the average of the performance metrics of the documents listed for all of the queries in each collection.
Figure 7 compares the SVD and TULVD methods for the three collections used in the testing process with the average minimum similarity values of the documents listed at the end of the indexing process with different rank values. The results for the MED and the TIME collections seem to be similar. In the ADI collection, it is seen that the similarity change rate is almost the same, although it has different values according to the increasing rank values. This difference can be attributed to the fact that the number of documents in the MED and the TIME collections is much higher than the ADI collection.
Figure 7 shows the minimum cosine similarity value change for the ADI, the MED, and the TIME collections with respect to k. Note that as the number k increases for the three collections, the rate of change of the minimum cosine similarity for successive steps decrease differently. Thus, it makes difficult to determine the similarity threshold used in document indexing and prevents successful indexing. For this reason, it is recommended to obtain values in which the rate of change of the value of k is high and the success of document indexing is good in order to increase the performance of the process and to retrieve the correct documents. In this case, k should be 50 for the ADI collection, 150 for the MED collection, and 150 for the TIME collection.
On the contrary, Figures 8, 9, and 10 illustrate the indexing success of the semantic vector space generated by both methods for the ADI, the MED, and the TIME collections, respectively, according to different k values.
According to the visual observation of the simulations presented in Figures 2, 3, 4, and 5 as well as the quantitative measurement of the recall and the precision given in Tables 3, 4, and 5, the TULVD can be a good substitute for the SVD to find the rank k approximation in the LSA modeling. To support our claim, in addition, when we examine Tables 6, 7, and 8, we observe that the TULVD-based LSA model produces results similar to the SVD-based LSA model for different cosine threshold values in these three collections. Moreover, Figures 8, 9, and 10 are examined, and the best k value for the SVD and TULVD application in the ADI collection is 50, in the MED collection 100, and in the TIME collection 200.
In result, the TULVD is as good as the SVD for retrieving the semantic structure of the textual document in the LSA modeling. The main advantage of the TULVD over the SVD is the efficient computation of the initial low-rank approximation as well as the efficient computation of the low-rank approximation when adding a new document and/or term to the existing LSA generated database, i.e., “updating”.
As a result of our experience in this study, it is thought that TULVD can be used as an alternative method in many areas of SVD used such as data compression, missing data completion, image processing, sound processing, noisy data cleaning, and especially signal processing. In addition, this study can be extended to cover fields such as text summarization, text similarity, keyword extraction, author detection, and text classification.
The ATI, TIME, and MEDLINE collections, which are well-known datasets, used to support the findings of this study have been obtained from the Glasgow Repository (http://ir.dcs.gla.ac.uk/resources/test_collections/).
Initial results of this study were presented in the 21st International Conference Mathematical Modeling and Analysis.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This study was supported with project 2016/150 by Kırıkkale University Scientific Research Projects (BAP).
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391–407, 1990.
- N. Seco, T. Veale, and J. Hayes, “An intrinsic information content metric for semantic similarity in wordnet,” in Proceedings of the 16th European Conference on Artificial Intelligence, pp. 1089-1090, IOS Press, Valencia, Spain, August 2004.
- J. R. Ruiz, “Sociological discourse analysis: methods and logic,” Forum Qualitative Sozialforschung/Forum: Qualitative Social Research, vol. 10, no. 2, pp. 1–22, 2009.
- M. Hanselman, M. Kirchner, B. Renard et al., “Concise representation of mass spectrometry images by probabilistic latent semantic analysis,” Analytical Chemistry, vol. 80, no. 24, pp. 9649–9658, 2008.
- T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse processes, vol. 25, no. 2-3, pp. 259–284, 1998.
- W. Song, J. Z. Liang, X. L. He, and P. Chen, “Taking advantage of improved resource allocating network and latent semantic feature selection approach for automated text categorization,” Applied Soft Computing, vol. 21, pp. 210–220, 2014.
- R. D. Fierro and P. C. Hansen, “Low-rank revealing UTV decompositions,” Numerical Algorithms, vol. 15, no. 1, pp. 37–55, 1997.
- J. L. Barlow and H. Erbay, “Modifiable low-rank approximation to a matrix,” Numerical Linear Algebra with Applications, vol. 16, no. 10, pp. 833–860, 2009.
- D. Watkins, Fundamentals of Matrix Computations, John Wiley and Sons, Hoboken, NJ, USA, 2002.
- G. Golub and C. V. Loan, Matrix Computations, The John Hopkins Press, Baltimore, MD, USA, 2013.
- J. R. Bunch and C. P. Nielsen, “Updating the singular value decomposition,” Numerische Mathematik, vol. 31, no. 2, pp. 111–129, 1978.
- M. W. Berry, S. T. Dumais, and G. W. O’Brien, “The computational complexity of alternative updating approaches for an svd-encoded indexing scheme,” in Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, pp. 39–44, San Francisco, CA, USA, February 1995.
- H. Erbay, J. L. Barlow, and Z. Zhang, “A modified Gram-Schmidt-based downdating technique for ULV decompositions with applications to recursive TLS problems,” Computational Statistics & Data Analysis, vol. 41, no. 1, pp. 195–209, 2002.
- G. W. Stewart, “An updating algorithm for subspace tracking,” IEEE Transactions on Signal Processing, vol. 40, no. 6, pp. 1535–1541, 1992.
- J. L. Barlow, “Modification and maintenance of ULV decompositions,” in Applied Mathematics and Scientific Computing, pp. 31–62, Springer, Berlin, Germany, 2002.
- J. E. Tougas and R. J. Spiteri, “Updating the partial singular value decomposition in latent semantic indexing,” Computational Statistics & Data Analysis, vol. 52, no. 1, pp. 174–183, 2007.
- T. A. Letsche and M. W. Berry, “Large-scale information retrieval with latent semantic indexing,” Information sciences, vol. 100, no. 1-4, pp. 105–137, 1997.
- S. T. Dumais, “Improving the retrieval of information from external sources,” Behavior Research Methods, Instruments, & Computers, vol. 23, no. 2, pp. 229–236, 1991.
- M. G. Ozsoy, I. Cicekli, and F. N. Alpaslan, “Text summarization of Turkish texts using latent semantic analysis,” in Proceedings of the 23rd International Conference on Computational Linguistics, pp. 869–876, Association for Computational Linguistics, Beijing, China, August 2010.
- M. W. Berry and R. D. Fierro, “Low-rank orthogonal decompositions for information retrieval applications,” Numerical Linear Algebra with Applications, vol. 3, no. 4, pp. 301–327, 1996.
- M. W. Berry, S. T. Dumais, and G. W. O’Brien, “Using linear algebra for intelligent information retrieval,” SIAM Review, vol. 37, no. 4, pp. 573–595, 1995.
- A. Huang, “Similarity measures for text document clustering,” in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC’2008), pp. 49–56, Christchurch, New Zealand, April 2008.
- E. Jessup and J. Martin, “Taking a new look at the latent semantic analysis approach to information retrieval,” in Computational Information Retrieval, pp. 121–144, SIAM, Philadelphia, PA, USA, 2001.
- B. Kang, D. Kim, and S. Lee, “Exploiting concept clusters for content-based information retrieval,” Information sciences, vol. 170, no. 2, pp. 443–462, 2005.
- M. Sokolova, N. Japkowicz, and S. Szpakowicz, “Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation, Lecture Notes in Computer Science,” in Proceedings of Australasian Joint Conference on Artificial Intelligence, pp. 1015–1021, Springer, Hobart, Australia, December 2006.
Copyright © 2019 Fahrettin Horasan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.