Abstract

Machine learning techniques have been widely used in almost every area of arts, science, and technology for the last two decades. Document analysis and query expansion also use machine learning techniques at a broad scale for information retrieval tasks. The state-of-the-art models like the Bo1 model, Bo2 model, KL divergence model, and chi-square model are probabilistic, and they work on DFR-based retrieval models. These models are much focused on term frequency and do not care about the semantic relationship among the terms. The proposed model applies the semantic method to find the semantic similarity among the terms to expand the query. The proposed method uses the relevance feedback method that selects a user-assisted most relevant document from top “” initially retrieved documents and then applies deep neural network technique to select the most informative terms related to original query terms. The results are evaluated at FIRE 2011 ad hoc English test collection. The mean average precision of the proposed method is 0.3568. The proposed method also compares the state-of-the-art models. The proposed model observed 19.77% and 8.05% improvement on the mean average precision (MAP) parameter with respect to the original query and Bo1 model, respectively.

1. Introduction

The most tedious task of the retrieval system is to retrieve the exact documents that are relevant to the user query. The user starts searching documents by putting the keywords in the form of a query. But generally, queries contain very limited terms to express his/her information need. Therefore, it is impossible to retrieve all relevant documents from a large document collection. Query expansion is a technique that selects the most informative terms that are related to a user query and expands the original query. Global and local methods are two major classes of query expansion. The global method selects expansion terms from some external datasets. WordNet [1] and Word2Vec [2] are the two most well-known external datasets that are commonly used to select semantically related query terms for expansion tasks. On other hand, the local method of query expansion selects the expansion terms from initially retrieved documents. The local method is mainly classified into two categories, pseudorelevance feedback method and relevance feedback method. The pseudorelevance feedback method selects the expansion terms from top “” initially retrieved documents. On the other hand relevance, the feedback method selects the expansion terms from the subset of top “” ranked documents that the user thinks are relevant to the user query.

The machine learning [3] technique is broadly used to extract useful information by training a dataset from a large collection of datasets. The machine learning technique has been applied by [4, 5] to identify the liver patients from a liver patient dataset and to find the positive and negative tweets from a social dataset. The neural network technique is widely used in data mining and text mining for classification tasks. The general architecture of a neural network contains an input layer, output layer, and an activation function that works between the input layer and output layer as a hidden layer. Weights are assigned initially and updated repeatedly unless some target value is not reached. A deep neural network is an extended form of neural network architecture that contains several hidden layers between the input layer and output layer. In a deep neural network, architecture weights are propagated from the input layer to the hidden layer and the hidden layer to the output layer, and vice versa. For document retrieval, there are two neural network-based approaches: (1)Continuous bag of word model (CBOW)(2)Skip-gram model

Continuous bag of word model predicts the center word for a given context word. The skip-gram model does the opposite of what the CBOW model does; i.e., it predicts the context word for a given center word.

The proposed model uses skip-gram architecture to train the corpus words. The proposed model initially selects the user-assisted relevant documents from top “” ranked documents. Here “” value is set to 30. For each query term, vocabulary terms within user-assisted relevant documents are trained to extract the context term and are then merged. The merged terms are then considered as the most informative terms for a given query and are treated as expansion terms. At the input layer, query terms are represented by one-hot encodings. In one-hot encoding representation, a vector of vocabulary size is initialized that contains all entry zero except the index where the query term appears. The index where the query term appears is initialized to 1. Then, the context words are predicted by successively updating the weight matrix between the input and the hidden layer and the hidden to the output layer. The detailed discussion is presented in Section 3. The proposed model has also been compared with state-of-the-art models [6]. The proposed model architecture is shown in Figure 1.

Vector space [7] was the oldest model for the retrieval task. Every retrieval model suffers from the most common shortcoming that the keyword submitted by a user is not focused on the main topic and also, the user is unaware of what he/she is looking for [8]. This leads to retrieval of irrelevant documents. Query expansion plays a key role to maximize the retrieval of relevant documents. Maron and Kuhns [9] applied query expansion technique to improve the performance of the retrieval system. Allan [10] used relevance feedback model for selecting the most informative terms for query expansion.

Relevance-based probabilistic language model [11] over the term collection has been developed to select the most informative terms for query expansion. M. Bendersky et al. [12] used external resources and probability language-based modeling for weighting the terms for query expansion task. Miao et al. [13] proposed a proximity-based query expansion model that emphasizes the proximity of terms rather than their positional information. Lv and Zhai [14] presented a positional relevance model that focuses on the position of query terms in the relevance feedback documents. Metzler and Croft [15] used Markov random field-based term dependency model for retrieving the most informative terms for query expansion. Dalton and Dietz [16] proposed relevance feedback-based neighborhood relevance model entity linking across the document and query collections for expanding the query. The expectation-maximization algorithm [17] is used to maximize the likelihood of relevant documents from top-ranked documents and then to retrieve the most informative terms from these relevant documents for query expansion tasks. Qiu and Frei [18] expanded the query using query log by calculating the correlation between query terms and document terms. Baeza-Yates and Tiberi [19] proposed a model where search engine query log is represented by a bipartite graph and edge connected between query node and URL node by click through. They observed an improvement of 10% on the mean average precision value. Deerwester et al. [20] proposed singular value decomposition-based Latent Semantic Indexing (LSI) technique to extract the most informative terms for query expansion. Singh and Sharan [21] proposed a fuzzy logic-based technique on top “” ranked documents for query expansion tasks. Word embedding-based neural network technique using Word2Vec for the query expansion task was proposed by [22]. They extracted similar terms to the original query terms using the -nearest neighbor method. Kuzi et al. and Mikolov et al. [23, 24] also used the word embedding technique for expanding the original query. Diaz et al. [25] used locally trained word embedding techniques such as Word2Vec or Glove for query expansion tasks. Imani et al. [26] proposed a Word2Vec-based continuous bag of word model for selecting the most informative terms for query expansion. Thesaurus-based query expansion model using Wikipedia was proposed by [27]. Xu et al. [28] used Wikipedia to categorize the query into three different categories broader queries, ambiguous queries, and entity queries. They selected the expansion terms using term distribution and structure of Wikipedia documents. Crouch and Yang [29] built a statistical thesaurus by clustering the whole document collection using a complete link clustering algorithm. Shukla and Das [30] proposed a pseudorelevance feedback and deep neural network-based method to expand the query. A hybrid model-based [31, 32] method was proposed to expand the query to remove word mismatch between corpus words and query terms. The article has been organized into 4 different sections. Section 3 describes the mathematical notations and formulations for the proposed model. Section 4 elaborates query expansion using the proposed model. Section 5 discusses the experimental efficiency of the proposed model. Section 6 discusses the overall performance of the proposed model. Section 7 concludes the result and the future work of the proposed model.

3. Mathematical Formulation of the Proposed Model

The proposed model is based on the deep neural network-based skip-gram model. The skip-gram model is used to predict the context words for a given center word positioned at “.” This model defines a fixed window size “” that assumes the words at position to as context words. For example, for a sentence “Abhishek is fond of learning Mathematics, Physics and Computer,” if stop words are applied, the sentence became “Abhishek fond learning Mathematics Physics computer.” If the center word is “learning,” then a window of size 2 context word will be {Abhishek, fond, Mathematics, Physics}. In this architecture, we create a one-hot vector representation of vocabulary size |N| for the given center word. In one-hot vector representation, we set 1 at the position where the center word has occurred and 0 on the rest positions. The sparse one-hot vector is then transformed into a lower-dimensional dense representation. The skip-gram model consists of the following notations for the mathematical modeling.

Input layer: one-hot vector “xc” of size (||,1) with 1 at the index where center word occurred and 0 at the rest of the indexes.

Target output layer: one-hot vector “yc” of size (||,1) with 1 at the index where center word occurred and 0 at the rest of the indexes.

Hidden layer: hidden layer of size (,1).

Predicted output: probability vector of size (||,1).

Random weight matrix: two random weight matrix of and of each size () between the input and predicted output layers.

Now, using the following notations, we have where represents the transpose of the given matrix. Now, the value obtained is mapped to weight matrix . Thus, we have

Now, the predicted output is computed using the following relations:

The loss function is basically the sum of the mutual crossentropies between predicted output values and target value computed. Mathematically, where are the context word for the center word , Now, using gradient decent method, where is the column of the weight matrix . Also, where

Now, our objective is to minimize the loss function; therefore, the weights are updated as follows: where represents the learning rate of the deep neural network architecture.

4. Query Expansion Using the Proposed Method

The proposed method comprises a user-assisted relevance feedback model to select the relevant document for training the query terms using the skip-gram model. Rocchio [33] was the first who used the relevance feedback method for query expansion. The user initially selects the relevant documents from the top “” retrieved documents. These relevant documents are then considered as a dataset to extract the most informative terms using the skip-gram model. The unique terms containing the relevant documents are considered vocabulary terms. Query terms are encoded using the one-hot encoding that creates a hot vector for each query term. The size of the hot vector is the size of the vocabulary. Each entry of the hot vector is 0 except the query term which is set as 1. In the skip-gram model, -gram refers to the number of skip words. For example, for query “tiger conservation in India,” a 2-gram representation will be “tiger conservation,” “conservation in,” and “India.” These grams are trained to predict the context words. Skip refers to the number of times query term is presented throughout the relevant documents. The skip-gram architecture consists three layers, the input layer, hidden layer, and output layer. At the input layer, a one-hot vector is supplied to predict the context words. At the output layer, -gram terms are trained to predict the context words. The hidden layer is the dense representation of the hot vector. A random weight matrix is assigned between the input and hidden layers and between the hidden and output layers. The structure of the skip-gram model is shown in Figure 2. The hot vector is mapped to a lower dimension representation into the hidden layer by applying the dot product between the hot vector and random weight matrix. If “” represents the one-hot vector for a query term and “” represent the random weight matrix, the hidden layer “” hot vector can be represented by

The hidden layer is propagated to the next layer by computing the dot product between the hidden layer and random weight matrix as

Weights are updated at each iteration by the following computations:

The most probable context words for each query term are predicted in an unsupervised manner by successively updating the weights and calculating the probability at the softmax layer and then merged. The merged component is appended to the original query term to expand the query. The algorithm of the proposed model is shown in Algorithm 1.

1. for each q in Q, do
  1.1. Retrieve top “k” documents dk from initial retrieved documents.
2. for each d in dk, do
  2.1. Retrieve user assisted relevant document du and store them in Du.
3. for each t in q, do
  3.1. Create a hot vector x form dataset Du.
4. iter←epoch
5. Initialize weight matrix w and w’ with random weights.
6. for i←1 to itr, do
  6.1. Compute h←wT.x
  6.2. Compute v←.h
  6.3. Update - e
  6.4. Update wnew←w - e
7. mer←[]
8. for each t in q, do
  8.1. Retrieve top 15 context words wcons for wt such that y is maximum.
  8.2. mer←mer+ wcons
9. qexp←q+mer
10. return qexp

5. Experimental Result

For the experimental analysis of the proposed method, we have used FIRE 2011 [35] ad hoc English test collections. Terrier 3.5 search engine has been used to evaluate the result of proposed method on the underlying dataset. Stop word and PorterStemmer are used to remove the stop word and to stem the root word, respectively. The experiment has been performed on 50 queries ranging from Q126 to Q175. The documents are retrieved using the InL2.0 model. The mean average precision (MAP) value of the proposed method is observed as 0.3568. To achieve this result, we have performed 70 epochs. The proposed model compares the result with other query expansion models. An improvement of 19.77% and 8.05% is observed with respect to original query and Bo1 model, respectively, on MAP parameter. The performance of the original query, query expansion using Bo1 model, query expansion using Bo2 model, query expansion using chi-square model, query expansion using KL divergence, query expansion using proposed model, and BM25 model is shown in Tables 17, respectively.

A -statistics is defined as where (i) is the sample mean(ii) is the mean populated in null hypothesis (iii) is the sample size(iv) is the population standard deviation

Now, sample mean ; the population mean standard deviation of the population is ; and sample size . Therefore,

Since it is observed that , it is concluded that the null hypothesis is not rejected. Therefore, there is no evidence to claim that population mean is different than 0.3302, at the 0.05 significance level. Therefore, improvement is minor.

The MAP comparison of the proposed model with other models is shown in Table 8. The query by query analysis of the proposed model vs. original query, query expansion using the Bo1 model, query expansion using the Bo2 model, query expansion using the chi-square model, query expansion using the KL divergence model, and BM25 model is shown in Figures 38, respectively. In the following figures, the -axis represents the query number and the -axis represents the MAP value.

6. Discussion

Tables 913 show that the proposed model has significant improvement over the original query, query expansion using the Bo1 model, query expansion using the Bo2 model, query expansion using the chi-square model, and query expansion using the KL divergence model, respectively. The proposed model has an improvement of 19.77%, 8.05%, 8.08%, 22.52%, and 7.56% in comparison to the original query, query expansion using the Bo1 model, query expansion using the Bo2 model, query expansion using the chi-square model, and query expansion using the KL divergence model, respectively. From Figures 3 to 7, it is clear that out of fifty queries, the proposed model performs well on 38, 34, 32, 40, and 34 queries with respect to the original query, query expansion using the Bo1 model, query expansion using the Bo2 model, query expansion using the chi-square model, and query expansion using the KL divergence model, respectively. The sample query and its expansion terms are shown in Table 14.

7. Conclusion

Relevance feedback-based semantic model using skip-gram has been developed to improve the performance of the retrieval system. The proposed model has been compared with other state-of-the-art models, and experimental result shows that the proposed model has significant improvement over state-of-the-art models. The mean average precision of the proposed model is 0.3568 which is an improvement of 19.77%, 8.05%, 8.08%, 22.52%, and 7.56% compared to the original query, Bo1 model, Bo2 model, chi-square model, and KL divergence model, respectively. Out of fifty queries, the proposed model beats on 38, 34, 32, 40, and 34 queries compared to the original query, Bo1 model, Bo2 model, chi-square model, and KL divergence model, respectively. The proposed model also retrieves 12 additional relevant documents compared to the original query. In the near future, we will try to further improve the result by using other neural network architecture.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

One of the authors is pursuing a full-time Ph.D. from the Department of Mathematics, Bio-informatics, and Computer Applications, Maulana Azad National Institute of Technology (MANIT) Bhopal (MP), India. He expresses sincere thanks to the Institute for providing an opportunity for him to pursue his Ph.D. work. The author also thanks the Forum of Information Retrieval and Evaluation (FIRE) for providing a dataset to perform his experimental work.