Scientific Programming

Volume 2017, Article ID 8131390, 19 pages

https://doi.org/10.1155/2017/8131390

## A Heterogeneous System Based on Latent Semantic Analysis Using GPU and Multi-CPU

^{1}Universidad Politécnica Salesiana, Cuenca, Ecuador^{2}Universidad de Guadalajara, Guadalajara, JAL, Mexico^{3}Technical and Industrial Teaching Center, Guadalajara, JAL, Mexico

Correspondence should be addressed to Gabriel A. León-Paredes; ce.ude.spu@noelg

Received 15 June 2017; Accepted 26 September 2017; Published 5 November 2017

Academic Editor: José María Álvarez-Rodríguez

Copyright © 2017 Gabriel A. León-Paredes et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing the term-by-document matrix using the Singular Value Decomposition (SVD) technique. However, LSA has a high computational cost for analyzing large amounts of information. The goals of this work are (i) to improve the execution time of semantic space construction, dimensionality reduction, and information retrieval stages of LSA based on heterogeneous systems and (ii) to evaluate the accuracy and recall of the information retrieval stage. We present a heterogeneous Latent Semantic Analysis (hLSA) system, which has been developed using General-Purpose computing on Graphics Processing Units (GPGPUs) architecture, which can solve large numeric problems faster through the thousands of concurrent threads on multiple CUDA cores of GPUs and multi-CPU architecture, which can solve large text problems faster through a multiprocessing environment. We execute the hLSA system with documents from the PubMed Central (PMC) database. The results of the experiments show that the acceleration reached by the hLSA system for large matrices with one hundred and fifty thousand million values is around eight times faster than the standard LSA version with an accuracy of 88% and a recall of 100%.

#### 1. Introduction

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing a term-by-document matrix using term weighting schemes such as Log Entropy or Term Frequency-Inverse Document Frequency (TF-IDF) and using the Singular Value Decomposition (SVD) technique. LSA improved one of the main problems of information retrieval techniques, that is, handling polysemous words, by assuming there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice [1]. LSA uses statistical techniques to estimate this latent structure and get rid of the obscuring “noise.” Also, LSA has been considered as a new general theory of acquisition of similarities and knowledge representation, which is helpful in simulating the learning of vocabulary and other psycholinguistic phenomena [2].

Latent Semantic Analysis, from its beginnings to the present, has been implemented in several research topics, for example, in applications to predict a reader’s interest in a selection of news articles, based on their reported interest in other articles [3]; in the self-diagnosis of diseases through the description of medical imaging [4]; in applications to detect cyberbullying in teens and young adults [5]; in the field of visual computers by improving techniques for tracking moving people [6]; in applications for the classification of less popular websites [7].

LSA has a computational complexity of , where is the smaller value between the number of documents and the number of terms and* k* is the number of singular values [8]. LSA takes a considerable amount of time to index and to compute the semantic space, when it is applied to large-scale datasets [9–11].

An introduction for a parallel LSA implementation based on a GPU has achieved acceleration of five to seven times with large matrices divisible by 16- and 2-fold for matrices with another size. The GPU is being used for the tridiagonalization of matrices, and the routines that compute the eigenvalues and eigenvectors of matrices are still being implemented on the CPU. The results present that the accuracy and speed needed further research in order to produce an effective fully implementable LSA algorithm [12].

A technique called* index interpolation* is presented for a rapid computation of the term-by-document matrix for large documents collections; the associated symmetric eigenvector problem is then solved by distributing its computation among any number of computational units without increasing the overall number of multiplications. The experiments took 42.5 hours to compute 300,000 terms on 16 CPUs [13].

We present a fully* heterogeneous system based on Latent Semantic Analysis* (hLSA), which utilizes the resources of both GPU and CPU architectures to accelerate execution time. Our aim is to compute, reduce, and retrieve information faster than standard LSA versions and to evaluate the accuracy and recall of the information retrieval procedure in the hLSA system. The performance of hLSA has been evaluated, and the results show that acceleration as high as eight times could be achieved, with an accuracy of 88% and a recall of 100%. An early version of the hLSA system has been presented as a poster in the GPU Technology Conference [14].

The rest of the paper is organized as follows. Section 2 introduces the related background of LSA. In Section 3, we present our heterogeneous Latent Semantic Analysis system. Section 4 gives a description of the design of experiments. Section 5 presents the results of the experiments. Finally, Section 6 concludes the work.

#### 2. Background

LSA takes a matrix of a term-by-document (*M*) and constructs a semantic space wherein terms and documents that are closely associated are placed near one another. Normally, the constructed semantic space has as many dimensions as unique terms. Additionally, instead of working with count data, the entries of matrix* M* are weighted with a representation of the occurrence of a word token within a document. Hence, LSA uses a normalized matrix which can be large and rather sparse. For this research, two types of weightings scheme are used.

(i) The first is a logarithmic local and global entropy weighting, known as a Log Entropy scheme. That is, if denotes the number of times (frequency) that a word appears in document and is the total number of documents in the dataset, thenwhere is the fraction of documents containing the th term; for example,This particular term weighting scheme has been very successful for many LSA studies [15], but other functions are possible.

(ii) The second is a term frequency and inverse document frequency weighting, known as a TF-IDF scheme, which assigns to word a weight in document , when is the total number of documents in the dataset and is the term frequency by documents matrix; for example,This particular term weighting scheme has been very successful for many LSA studies [8].

To reflect the major associative patterns in matrix* A *and ignore the smaller less important influences, a reduced-rank approximation of matrix* A* is computed using the truncated Singular Value Decomposition [16]. Note that the SVD of the original weighted matrix can be written aswhere is the words-by-documents matrix; is a orthogonal matrix whose values represent the left singular vectors of ; is a orthogonal matrix whose values represent the right singular vectors of ; and is a diagonal matrix which contains the singular values of in descending order. Note that is the smaller value between the total number of words and the total number of documents* d*.

To obtain the truncated SVD denoted by , it is necessary to restrict SVD matrices to their first < min (terms, documents) dimensions, as revealed byChoosing the appropriate number of dimensions is an open research problem. It has been proposed that the optimum value of* k* is in a range from 50 to 500 dimensions, depending on the size of the dataset. As described in [17], if the number of dimensions is too small, significant semantic content will remain uncaptured, and if the dimension is too large, random noise in word usage will be remodeled. Note that the truncated SVD represents both terms and documents as vectors in -dimensional space.

Finally, for information retrieval purposes, the -dimensional semantic space is used. Thus, the terms of a user query are folded into the -dimensional semantic space to identify a point in the space. This can be accomplished by parsing a query into a vector denoted by whose nonzero values correspond to the term weights of all unique valid words of the user query. Then, the query folding process denoted by can be represented asThen, this vector can be compared with any or all documents/terms vectors of the -dimensional semantic space. To compare vectors, the dot product or cosine between points is used. For example,where is a vector representation of the -dimensional space. LSA proposes the retrieval of information in two ways: by establishing a minimum value of similarity, for example, all the similarities greater than 0.90, and by obtaining the top-ranked values of similarity, for example, the top 10 and the top 5.

#### 3. Heterogeneous Latent Semantic Analysis System

The hLSA system has several technical challenges: how to construct the* semantic space* using the multi-CPU architecture to speed up the text processing; how to* reduce the dimensionality* of the term-by-document matrix using the GPU architecture to accelerate the matrices processing; and how to* retrieve information* from the semantic space using GPU mechanisms to speed up the matrix and text computations. Figure 1 presents the proposed hLSA system for constructing, reducing, and retrieving relevant documents using heterogeneous architectures.