Abstract

In the field of information retrieval, most pseudo-relevance feedback models select candidate terms from the top documents returned by the first-pass retrieval, but they cannot identify the reliability of these documents. This paper proposed a new approach to obtain feedback information more comprehensively by constructing four corresponding models. Firstly, the algorithm incorporated topic-based relevance information into the relevance model RM3 and constructed a topic-based relevance model, denoted as TopRM3, with two corresponding variants. TopRM3 estimated the reliability of a feedback document in language modeling when executing pseudo-relevance feedback from both term and topic-based perspectives. Secondly, the algorithm introduced topic-based relevance information into Rocchio’s model and constructed the corresponding model, denoted as TopRoc, with two corresponding variants. Experimental results on the five TREC collections show that the proposed TopRM3 and TopRoc are effective and generally superior to the state-of-the-art pseudo-relevance feedback models with optimal parameter settings in terms of mean average precision.

1. Introduction

Pseudo-relevance feedback (PRF) via query expansion (QE) is usually considered a very effective method for achieving good performance in information retrieval (IR) [17]. The PRF model uses an automatic local analysis approach that assumes that the top-ranked documents in the first round of retrieval are relevant and then uses them as feedback documents to redefine the representation of the original query by adding potentially relevant terms, thereby improving the performance of IR.

Although the PRF models usually perform very well, they fail in some cases [812]. As in the classic PRF models like Rocchio’s model [8] or relevance model RM3 [9], all the top feedback documents are assumed to be equally relevant to the queries. The weights of candidate terms in them are only based on their salience in the collection. These models select the candidate documents without identifying their reliability. Generally, terms in different feedback documents with the same weights (e.g., TF-IDF (term frequency-inverse document frequency) score) are considered to be equally reliable for QE. The models with classic PRF strategies (e.g., Rocchio and RM3) do not perform well enough when some of the feedback documents include varied topics, many of which are not related to the original queries [1012]. In this case, plenty of unrelated terms involving unrelated topics in these documents are also added into the new query representation, which negatively influences the retrieval performance in the second-pass retrieval.

Recently, researchers have begun to apply topic models [1317] for PRF to obtain feedback terms in the most relevant topic(s). However, most of them select candidate terms from the top documents returned at the first-pass retrieval without identifying the reliability of these documents. Thus, there is a big obstacle for this application because the original query is generally short and their topics are fuzzy. For this problem, Miao et al. [10] proposed a probabilistic framework TopPRF via integrating “topic space” (TS) information into Rocchio’s model, which introduces the reliability of the feedback documents by considering the relevance between top-3 documents and the rest of the documents.

From the discussion above, when selecting candidate terms, most of the researchers do not identify the reliability of the top documents returned at the first-pass retrieval. In this work, we propose to estimate the reliability of a feedback document by introducing the topic-based relevance between feedback documents into the language modeling approach. Different from the work proposed by Miao et al. [10], our method could be considered as a generalized approach that can be incorporated into any other PRF model.

The remainder of this paper is organized as follows. In Section 2, we propose topic-based relevance models for pseudo-relevance feedback. In Section 3, we set up our experimental environment on five TREC [18] collections. In Section 4, we present and discuss the experimental results. Finally, we conclude our work briefly and present future research directions in Section 5.

2. Topic-Based Relevance Model

In this section, we propose topic-based relevance models TopRM3 and TopRoc. We first briefly introduce the traditional relevance model RM3 and present the adaption of the relevance model to topic-based relevance information. Then, we demonstrate in detail how to adopt the relevant information in the two investigated measures. In addition, we integrate our topic-based relevance into Rocchio’s model.

2.1. Relevance Model

Relevance models are based on classic frameworks for implementing (pseudo) relevance feedback via improving the query representation. They incorporate (pseudo) relevance feedback information into the query language modeling. Relevance models do not explicitly model the relevant or pseudo-relevant documents. Instead, they model a more generalized notion of relevance . The weight of a candidate term in relevance model RM1 [9] iswhere is a query, is a document in a feedback document set , is a document language model, and is a query language model. Many other relevant feedback techniques and algorithms [19, 20] have been successfully presented under the relevance model framework. In this paper, we take advantage of this framework but utilize a topic-based relevance of a document instead of estimated from the document score, which is a relative relevance in feedback documents other than absolute scores. Our proposed topic-based relevance model is as follows:

The topic-based relevance of a document in the feedback document set will be discussed in detail in the next section. Similarly, the variant of the relevance model, denoted as RM3 [9], makes a linear combination of the original query model and the topic-based relevance model to estimate the feedback language model . The corresponding formula is as follows:

Lv and Zhai [16] systematically compared five state-of-the-art approaches for estimating query language models in ad hoc retrieval, in which RM3 not only yields impressive performance in both precision and recall metrics but also performs steadily. In particular, Dirichlet prior for smoothing document language models [21] is used. In this paper, we introduce the strategies mentioned above and present our model, denoted as TopRM3.

2.2. Topic-Based Relevance

Given a query , the top feedback document set in first-pass retrieval can be represented as their topic distribution () in the topic space. However, it fails while directly measuring the topic-based relevance between a query and a document because the topic distribution of a short query is usually very sparse and coarse. The previous work [10] found that the top document groups are most likely relevant to the query topics. Sometimes the relevant documents for a particular query cover several topics. To maintain the balance of document relevance and topic diversity, we set the value of to be three in our research and view the topics in the top documents as query topics.

First, we measure the similarity between the topic vectors of two documents via the cosine formula. Thus, the topic similarity of documents and is as follows:where is the topic and is a topic distribution in .

Then, when  = 3, topic similarity () is calculated as follows:where the scores of the top three documents are set to 1 because they are supposed to be relevant to the query.

Finally, we transform the topic similarity () to the topic-based relative relevance of documents . Since is a distribution, we employ two normalized schemes, one of which is a linear method and the other is a softmax method commonly used in machine learning.where equation (6) represents the linear method and equation (7) represents the softmax method.

Using these two variants, we present our topic-based relevance models, denoted as TopRM3-L and TopRM3-S, respectively.

2.3. Topic-Based Rocchio’s Model

We further integrate our topic-based relevance information into Rocchio’s model, which is different from the model in [10]. The topic-based Rocchio’s model TopRoc could be described as follows:(1)All documents are ranked for the given query using a particular IR model. We use the BM25 in the first-pass retrieval. The || highest-ranked documents are identified as the pseudo-relevance set .(2)Each term in the || highest-ranked document is assigned an expansion weight. In general, the expansion weight is the dot product of the weights provided by a weighting model and topic-based relevance of documents. The TF-IDF weighting model [22] is used as the weighting model in this article.(3)The vector of the query term weight is the linear combination of the initial query term weight and the expansion weight. Its formula is as follows:where 0 and 1 represent the original and first iteration query vectors, is the TF-IDF weight vector for the feedback document , is the topic-based relevance for the feedback document is the feedback document set for PRF, and and are tuning constants controlling how much we rely on the original query and the feedback information. In practice, we can always fix at 1 and only study to get better performance.

If follows a uniform distribution, the TopRoc model is the original Rocchio’s model. We can also use the two variants of defined in Section 2.2, and the corresponding models are denoted as TopRoc-L and TopRoc-S, respectively.

3. Experimental Settings

To testify the effectiveness of our proposed method, we test our proposed model on five public TREC collections with ad hoc queries, which include DISK1&2 with queries 51–150, DISK4&5 with queries 401–450, WT2G with queries 401–450, WT10G with queries 451–550, and GOV2 with queries 701–850. These collections are different in size and genre. In our experiment, we only use the title field of the TREC queries for retrieval because the search engine users always type in their query, which is short enough to represent their query intention. Thus, we remove queries without judgment. Besides, we also remove the standard English stopwords and use Porter’s English stemmer to stem each term for all collections. We use the official TREC evaluation measure, namely, mean average precision (MAP), to evaluate the effectiveness of our proposed models. All statistical tests are based on the Wilcoxon matched-pairs signed-rank test.

In our topic-based test experiments, we compare our proposed methods with LM, BM25, and two state-of-the-art PRF models, RM3 and Rocchio’s model. Although there are several effective PRF methods, since in this paper we study the effect of the proposed topic-based relevance in the RM3 and Rocchio’s model, we do not consider other existing PRF methods. To make a fair comparison, we use the following parameter settings for both the baselines and our proposed topic-based models, which are popular in the IR domain for building strong baselines.

First, in LM, Dirichlet smoothing parameter  = 1000 was shown in [23] to achieve the best MAP for most collections. In BM25, setting 1, 3, and to 1.2, 8.0, and 0.35, respectively, gave the best MAP for most collections in [24].

Second, for the parameters in the PRF model, the number of expansion terms is fixed to 30. We sweep over values for the number of top documents (||∈10, 20, 30, 50), and the interpolation parameters ∈{0.0,0.1, ..., 1.0} in relevance models and ∈{0.0, 0.1, ..., 1.0} in Rocchio’s models.

Third, in LDA [11], the number of topics is suggested to be 5, 10, and 20 [10]. All experimental results are evaluated through two-fold cross-validation. The TREC queries are partitioned into two sets by the parity of the TREC queries’ number on each collection. Parameters trained on one set are applied to the other set and then vice versa for evaluation.

4. Experimental Results and Analysis

All experimental results are shown in Table 1. RM3 denotes the relevance model with LM as the first-pass retrieval model, and Rocchio denotes BM25+Rocchio.

RM3 outperforms LM on all collections, and Rocchio’s model outperforms BM25. This indicates that RM3 and Rocchio’s model are still very strong baselines for IR research work. Compared to Rocchio’s model, RM3 performs more steadily when the number of feedback documents changes.

In Table 1, our proposed TopRM3 models outperform RM3 while TopRoc models outperform Rocchio’s model on all of the collections. Especially, TopRM3-L and TopRoc-L significantly improve over RM3 and Rocchio’s model, respectively. TopRoc-L obtains the best performance in terms of MAP in most cases.

We also compare our proposed TopRoc with TS-COS, the most effective variant of TopPRF [10], which is considered one of the most effective state-of-the-art PRF models. To make a fair comparison, we use their improvement percentage over corresponding Rocchio’s model. The result is presented in Table 2. From Table 2, our proposed TopRoc-L has greater improvement in most cases, which indicates that our proposed topic-based relevance is more effective.

Furthermore, we study the sensitivity of our proposed TopRM3-L to both the number of feedback documents and the feedback interpolation coefficient in terms of MAP. The results are presented in Figures 1 and 2. Figure 1 shows that the performance of TopRM3-L varies slightly when the number of feedback documents changes. Besides, these results show that the number of feedback documents is recommended to be set as 20 for obtaining overall good performance. Figure 2 demonstrates that the behavior of TopRM3-L is similar for all five collections, and the feedback coefficient is suggested to be around 0.6.

5. Conclusion and Future Work

In this paper, we propose four topic-based relevance models for PRF by integrating topic-based relevance information in feedback documents. Specifically, we present TopRM3-L, TopRM3-S, TopRoc-L, and TopRoc-S for enhancing PRF via topic modeling. Each of the two models captures relevant information at both the term and topic levels. Our models are effective and outperform the corresponding strong baselines on all of the collections in terms of MAP, according to the experimental results.

Meanwhile, our proposed models also outperform the state-of-the-art RM3 and Rocchio’s models. Additionally, we analyze the influence of the feedback coefficient on our proposed models and the performance of our proposed models with a different number of feedback documents.

The findings of this paper can be used in other natural language processing fields such as information systems, information search, recommendation systems, and other natural language processing fields, as well as in medical information processing fields such as clinical information retrieval and genomics information retrieval to accurately measure the relevance of documents.

In the further study, we will integrate our proposed topic-based relevance with some state-of-the-art proximity-based PRF models, such as PLM and PRoc. It is also interesting to conduct an in-depth study to combine traditional term-based and topic-based relevance information.

Data Availability

The data used to support the findings of this study are available from the following website: https://trec.nist.gov/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported in part by the general program of the National Natural Science Foundation of China (nos. 62076215 and 61671105) and the Yancheng Institute of Technology High-Level Talent Research Initiation Project (XJR2022001).