Scientific Programming

Volume 2019, Article ID 1095643, 12 pages

https://doi.org/10.1155/2019/1095643

## Alternate Low-Rank Matrix Approximation in Latent Semantic Analysis

Computer Engineering Department, Engineering Faculty, Kırıkkale University, Yahşihan, 71450 Kırıkkale, Turkey

Correspondence should be addressed to Fatih Varçın; rt.ude.ukk@nicravhitaf

Received 18 June 2018; Accepted 29 November 2018; Published 3 February 2019

Academic Editor: Danilo Pianini

Copyright © 2019 Fahrettin Horasan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The latent semantic analysis (LSA) is a mathematical/statistical way of discovering hidden concepts between terms and documents or within a document collection (i.e., a large corpus of text). Each document of the corpus and terms are expressed as a vector with elements corresponding to these concepts to form a term-document matrix. Then, the LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection. This is an essential step in the LSA. The singular value decomposition (SVD) has been the main tool obtaining the low-rank approximation in the LSA. Since the document collection is dynamic (i.e., the term-document matrix is subject to repeated updates), we need to renew the approximation. This can be done via recomputing the SVD or updating the SVD. However, the computational time of recomputing or updating the SVD of the term-document matrix is very high when adding new terms and/or documents to preexisting document collection. Therefore, this issue opened the door of using other matrix decompositions for the LSA as ULV- and URV-based decompositions. This study shows that the truncated ULV decomposition (TULVD) is a good alternative to the SVD in the LSA modeling.

#### 1. Introduction

The latent semantic analysis (LSA) is a mathematical/statistical method which is used for discovering the existing latent relationships between terms and documents or within a collection of documents (i.e., a large corpus of text) [1]. Although the LSA works especially well on textual data, it has been very popular in the academic community recently because of its wide variety of practices in the content information [2], sociological discourse analysis [3], image retrieval systems [4], human cognition, and human learning [5]. The LSA can be applied to any collection of documents that is cleaned from the syntactical and grammatical structure. If the collection of documents contains *m* terms and *n* documents, it is represented by using a matrix *A* of dimension and called the term-document matrix.

The LSA uses a low-rank approximation to the term-document matrix in order to remove irrelevant information, to extract more important relations, and to reduce the computational time. The irrelevant information is called as “noise” and does not have a noteworthy effect on the meaning of the document collection [6].

The low-rank approximation of the term-document matrix *A* is, for a positive constant , the matrix that satisfieswhere represents either two-norm or *Frobenius-norm*. The existence of such a matrix follows from the singular value decomposition (SVD) of *A*. Moreover, with no doubt, the truncated singular value decomposition is the main tool for solving the minimization problem given by (1). However, in the LSA where document collections are dynamic over time, i.e., the term-document matrix is subject to repeated updates, the SVD becomes prohibitive due to the high computational expense. Thus, alternative decompositions have been proposed for these applications such as low-rank ULV/URV decompositions [7] and truncated ULV decomposition (TULVD) [8]. Recall that, the initial computing cost for the low-rank ULV/URV decompositions and the TULVD lower is than the SVD [7].

The manuscript demonstrates that the TULVD is a good substitute for the SVD in the LSA modeling.

The rest of the manuscript is organized as follows. In Section 2, we introduce some notations and cover critical background materials in numerical linear algebra. In Section 3, we give the main steps of our LSA modeling. Then in Section 4, we test our model using some commonly used test collections and present some simulation results. In Section 5, we comment on simulation results.

#### 2. Notations and Background

##### 2.1. Notations

Throughout the paper, uppercase letters such as *A* denote matrices. The identity matrix is denoted by . Moreover, the norm denotes the spectral norm, and denotes the Frobenius norm. The notation represents the set of real matrices. An dimensional matrix *A* is represented as where is the entry of *A* at *i* row and *j* column with and .

##### 2.2. Orthogonal Matrix Decompositions

*Definition 1 (the singular value decomposition). *For a matrix with , the singular value decomposition (SVD) iswhere the left and right singular matrices *W* and *Y* are orthogonal matrices and where is a diagonal matrix with the following order:The diagonal entries of are called the singular values of *A*.

For a given positive integer , we block partition the SVD in (2) aswhere and are diagonal matrices containing the *k* largest and smallest singular values of *A*, respectively, (Figure 1). The matrix defined byis called rank *k* matrix approximation to *A*. For some tolerance, is proportional to the machine unit, and if the singular values satisfythen the value *k* is called the numerical rank of the matrix *A*. However, we are aware that the determination of the numerical rank is a sensitive computation, especially when there is no well-defined gap between singular values [9, 10]. Moreover, in some situations, like the example in Section 5.4.1 of [10], the tolerance is chosen slightly bigger. The time complexity of obtaining is [7].

Even though the SVD provides accurate subspaces, as we mentioned above, it is not suitable for dynamic problems where data changes (i.e., update and/or downdate) due to high computational demand in both dense [11] or sparse matrices [12]. Herein, we consider ULV-based TULVD for approximating the matrix subspaces.