Abstract

Due to the ambiguity and impreciseness of keyword query in relational databases, the research on keyword query expansion has attracted wide attention. Existing query expansion methods expose users’ query intention to a certain extent, but most of them cannot balance the precision and recall. To address this problem, a novel two-step query expansion approach is proposed based on query recommendation and query interpretation. First, a probabilistic recommendation algorithm is put forward by constructing a term similarity matrix and Viterbi model. Second, by using the translation algorithm of triples and construction algorithm of query subgraphs, query keywords are translated to query subgraphs with structural and semantic information. Finally, experimental results on a real-world dataset demonstrate the effectiveness and rationality of the proposed method.

1. Introduction

In recent years, keyword query in relational databases has been widely applied due to its simplicity and ease of use [1, 2]. Although this method does not require users to be knowledgeable about the underlying structure of databases and structured query language (e.g., SQL), its semantic fuzziness and its expressive power are limited due to the lack of structure [3]. In addition, ordinary users are usually unable to specify exact keywords to describe their query intention, which makes it harder to return adequate results through keyword query [4, 5]. For these reasons, the precise and recall of keyword query methods cannot be effectively guaranteed [6]. Therefore, query expansion has even been an important research branch, which can completely and precisely interpret the query and then improve the recall and precision of query results [710].

Problem and Motivation. Query expansion is to provide more descriptions for the information requirements to improve the query performance. First, we formally define the problem of query expansion as follows.

Definition 1 (query expansion). Given an input keyword query over a relational database, query expansion is to find related queries with largest , where evaluates the relevance between and .

Then, we will illustrate the research motivation of this paper by the following examples and analysis. Although numerous query expansion methods can be found in the scientific literature, unfortunately these studies suffer from two main limitations. First, existing approaches do not consider the relationship between keywords, which causes the low query precision. Second, most of the existing work neglects the similar words or related items of query keywords, leading to the low query recall. On the one hand, Meng et al. propose a semantic approximate keyword query method based on keyword and query coupling relationships [11]. This method partly solves the problems of semantic fuzziness and limitations of expression, but it analyzes the keyword and query coupling relationships through query history. When the query history of database is incomplete and even missing, the method will not be able to conduct semantic analysis normally. Additionally, the queries obtained by this method contain only content information related to the keywords rather than the structure information between keywords, thus affecting the query precision. For example, suppose a user issues the keyword query “Machine learning Arthur Samuel” in DBLP to retrieve the paper “Machine learning” published by Arthur Samuel. The query expansion contains “Machine learning Arthur Samuel SIGMOD” using the above query expansion method. The expansion extends the query with the content but lacks structure information. Consequently, all tuples containing “Machine learning,” “Arthur Samuel,” or “SIGMOD” will be returned. Obviously, such results are not precise enough and many of them may not be useful. The method proposed in this paper extends the initial query with content and structure related to the keywords. An expanded query is as shown in Figure 1.

It can express the potential semantic and structural information of original keyword query: find out the paper “Machine learning” published in SIGMOD and authored by Arthur Samuel; simultaneously find out the paper “Markov” which cites paper “Machine learning” and authored by Arthur Samuel. In contrast to the query expansion in [11], the method proposed in this paper extends the original query to query subgraphs with underlying structure of databases and thus further improves the query precision. On the other hand, Ganti et al. [12] translates keyword queries into SQL based on the materialized mappings. Bergamaschi et al. [3] propose a Metadata method to translate keyword queries into SQL based on Munkres algorithm. Though these methods describe users’ query intention to a degree, they do not take into account similar words and related items extension. Thus these methods have relative high query precision, but their recall needs to be further improved. For example, assume that a user hopes to study the methods of data analysis. Because of his knowledge limitation, the user accesses the database DBLP and submits a keyword query “Machine learning Arthur Samuel,” and the expanded query obtained via the above methods is as follows: select R3.Title from Author R1, PaperAuthor R2, Paper R3 where R2.Pid=R3.Pid and R2.Aid=R1.Aid and match(R3.Title) against (“Machine learning” in Boolean mode) and match(R1.Name) against (“Arthur Samuel” in Boolean mode), while the user may also be intending to retrieve papers which have similar or relevant topics to “Machine learning” such as paper “data mining” published by Micheline Kamber. Since paper “data mining” cites paper “Machine learning” and they are much related to each other, the user is also interested in it. The results of the query expansion method proposed in this paper are as shown in Figure 2. Compared with the previous result, these results not only have structural relationship between keywords but also contain the related or similar queries with query keywords.

The analysis and examples of the query expansion methods in Section 1 illustrate that the key challenge here is to develop an approach which balances both query precision and recall. In this paper we focus on the problem of how to tackle the above two limitations and then improve the performance of keyword query expansion approach. We propose a novel query expansion method ReInterpretQE based on query recommendation and interpretation, which extends a keyword query to a list of query subgraphs. These subgraphs could better capture users’ information need and the possible semantics of the keyword query and then guide users to explore related tuples in the relational database. First, we construct a term similarity matrix based on the tuple information contained in relational databases. Second, we make the query recommendation using the similarity matrix and dynamic programming to get a query list. This query list consists of top- queries related to the initial query. Finally, we transform the keyword query into query subgraphs. Through the query recommendation and interpretation, the query expansion method improves the recall and precision of query results.

The main contributions of this paper are summarized as follows:(1)We present a keyword query expansion paradigm ReInterpretQE in relational databases, which is based on query recommendation and interpretation.(2)We design a probabilistic recommendation algorithm based on similarity matrix and dynamic programming.(3)We propose a keyword query interpretation method. It uses statistical information and schema graph of database for translating a keyword query to query subgraphs.(4)We conduct extensive experiments on the DBLP dataset, and experimental results demonstrate the effectiveness and feasibility of the proposed method.

The paper is organized as follows: Section 1 introduces the research motivation, main contribution, and structure of this paper; Section 2 reviews the related work; Section 3 describes the architecture of approach ReInterpretQE and then provides details of algorithms in query recommendation and query interpretation; Section 4 conducts the experiments on a real dataset and compares the experimental results; Section 5 concludes this paper and prospects the study in future.

This section discusses the related research work. It mainly includes the following two parts: keyword query in relational databases and keyword query expansion.

2.1. Keyword Query in Relational Databases

Recently, keyword query methods in relational databases have been studied extensively [2, 1315]. According to the different modeling methods of databases, the existing query methods can be divided into two main categories: schema graph-based methods and data graph-based methods. In [1621], the database is modeled as a schema graph, where nodes represent tables and edges represent primary-foreign-key relationships. These methods enumerate all possible CNs (Candidate Networks) based on the schema graph. Although these methods store the abstract structural information and take little memory, the generation process of CNs needs to take a huge amount of time. Correspondingly, in [2227] the database is modeled as a data graph, where nodes represent tuples and edges represent primary-foreign-key relationships between tuples. The methods identify the minimum connection trees that contain the keywords on the data graph. A major challenge of the methods is the maintenance of data graph. The whole data graph needs to be reconstructed when the data in database changes, which is often a time-consuming exercise. Therefore, it is important to study the dynamic update of databases and design incremental query methods based on the dynamic construction of data graph.

2.2. Keyword Query Expansion Paradigm

In the field of keyword query in relational databases, the majority of existing studies have focused on how to improve the efficiency of query algorithms, while effective query preprocessing has yet to be investigated. Since keyword queries lack the structural information and users tend to select inappropriate keywords, semantic fuzziness becomes an urgent problem to be solved. The method in [28] expands the original query by using query log mining techniques, but it is not applicable in the relational databases. Reference [7] selects the related query expression using user feedback. The results obtained by this query expression are more accordant with users’ query intention. However, it needs the interaction of humans, so its efficiency is low. In [3], the keyword query is transformed to a SQL statement based on Munkres, which provides possible semantic descriptions. This method is useful for identifying users’ query intention. Nevertheless, the approach does not consider the multiple connection between keywords (i.e., there are various explanations for a keyword query). In addition, the above approach does not take into account the similar words and related items which can well express the intention of users. An analysis model about coupling relationship is presented in [11]. It extracts semantic relations based on query history. However, this model also has great limitations. When the query logs are missing or user’s preference often changes, the model cannot be applied effectively. Therefore, this paper proposes a query expansion method ReInterpretQE, which consists of two steps: query recommendation and query interpretation. First, we construct similarity matrix based on the structure and content information in databases and put forward a probabilistic recommendation algorithm using dynamic programming. Second, we come up with a keyword query interpretation method to transform the keywords to subgraphs based on the statistics and schema graph of database. Experimental results show that both the recall and precision of query results have been improved by using the proposed query expansion method.

3. Query Expansion Approach

3.1. Overview of ReInterpretQE

To solve the two problems of semantic fuzziness and limitations of expression, this paper comes up with a novel two-step query expansion method, ReInterpretQE. This method is based on query recommendation and query interpretation. The goal of query recommendation is to extend the initial query to a list of keyword queries related to it, so that the query results are more comprehensive and better to meet the demands of users. The query interpretation is to translate the list of keyword queries into query subgraphs, which can lock the query results more precisely. It is designed to improve the recall and precision of query results. Figure 3 shows the architecture diagram of ReInterpretQE, which is divided into two main phases: query recommendation and query interpretation.

Phase 1 (query recommendation). In the process of query recommendation, the intrasimilarity and intersimilarity between terms are calculated using the structure information, content information, and words cooccurrence, and a similarity matrix is constructed based on the above two similarities. Then the idea of dynamic programming is used to build Viterbi model; thus a keyword query can be extended to a keyword query list . The query list produced by the query recommendation process is semantically related to the original query.

Phase 2 (query interpretation). In the process of query interpretation, an algorithm is put forward for translation from keywords to triples. Then query subgraphs are built for each query in query list using the schema graph of database. The implementation detail of the algorithm will be introduced in the subsequent sections.
First, Section 3.2 describes how to recommend query list with the same or close similarity of meaning according to structure information and content information in databases. This problem can be solved effectively by the construction of term similarity matrix and probabilistic recommendation algorithm. The method makes the query results contain more information that users want to obtain. Thus the query recall can be further improved. Section 3.3 puts forward a two-step query interpretation method. The keyword query can be translated into query subgraphs with potential structural information through the following steps: Step  1: translation from keywords to triples; Step  2: construction of query subgraphs. With the above process of query interpretation, the query precision is improved effectively.

3.2. Query Recommendation

It is difficult to specify proper keywords to express query intention for a common user. Therefore, a lot of information related to query cannot be returned as results. This section presents a query recommendation method to extend the original query and derives a series of queries which have a similar semantic to the original query. Thus the recall of the query results can be improved. Assume that a user submits a query . First, Section 3.2.1 constructs the term similarity matrix to find the top- keywords related to any keyword in original query . Second, Section 3.2.2 proposes a probabilistic recommendation algorithm. It uses dynamic programming to build Viterbi model for translating the original query keyword to a query list .

3.2.1. Construction of Term Similarity Matrix

In the construction phase of term similarity matrix, from two aspects of intrasimilarity and intersimilarity, this paper calculates the similarity between keywords based on the structure information and content information.

(a) Intrasimilarity. Normally, in information retrieval if two keywords often appear in the same documents, thus keywords are regarded as semantically related. In relational databases, tuples are generally taken as virtual documents. Similarly, the higher the cooccurrence of two keywords in the same tuples, the higher the degree of similarity between two keywords. The intrasimilarity between two keywords is measured using Jaccard similarity coefficient, as shown inwhere and are tuple sets containing and , respectively.

We can obtain the intrasimilarity between any two keywords in databases by formula (1). Example 2 will further show the calculation process above.

Example 2. To facilitate explanation of the approach, we simplify the structure and content of database as shown in Table 1. There is a data table with four tuples , , , and in DBLP database. We use , , , , , , and to represent database, query, structured information, statistical analysis, machine learning, probability, and data mining. The intrasimilarity between keywords and is as follows:Similarly, we can calculate the intrasimilarity between any two keywords and get the following intrasimilarity matrix :

(b) Intersimilarity. The intrasimilarity reflects the direct semantic similarity according to the cooccurrence of keywords. Additionally, there is also indirect semantic similarity between keywords. For example, although and do not appear in a tuple at the same time, they have the indirect similarity via keyword , because and appear in the same tuple , while and appear in the same tuple . Keyword is called the semantic associative term; thus keywords and have the indirect semantic similarity. Next, we will detail the computing process of intersimilarity.

Suppose keywords and belong to different tuples and the set of semantic associative terms is . If is any element in , then the intersimilarity between and via associative term is as shown in

The intersimilarity between and is as shown in

Example 3. We take the DBLP database in Table 1, for example, as well. Example 2 leads us to know that the intrasimilarity between keywords and is . They have the intersimilarity via keywords and , where , , , and . Thus the intersimilarity between and via is as follows:The intersimilarity between and via is as follows:In conclusion, the intersimilarity between and is as follows:The intersimilarity matrix of DBLP database in Table 1 is as follows:

(c) Construction of Term Similarity Matrix. Formula (10) integrates the intrasimilarity and intersimilarity to calculate the similarity between any two keywordswhere is the balance factor to adjust the contribution of two similarities to the final results. From the result of parameter setting experiment in Section 4.2, the precision of term similarity calculation reaches the maximum when equals 0.5. So we can get the following term similarity matrix of DBLP database in Table 1:

When a user submits a query , we can get the keyword lists , related to the original query according to the term similarity matrix .

3.2.2. Probabilistic Recommendation Algorithm

This section comes up with a probabilistic recommendation algorithm. We build the Viterbi model using dynamic programming and generate the query list related to query input, as shown in Algorithm 1.

Input: Keyword query , list of related keywords
Out: , where
Method:
Delete the redundant related keywords
     
Build Viterbi model
     
     
     
Initialize
     //Variable is the maximum probability in all paths whose state is at time .
     //Variable the th node of path with the maximum probability in state at time .
     
     
Loop
     For
     
     
End
     Set
     Set
Paths backtracking
     For
     
Return keyword queries
     
3.3. Query Interpretation

As a fuzzy query method, keyword query cannot reflect query intention accurately. This section translates the keyword query to a set of query subgraphs by the translation algorithm of triples and construction algorithm of query subgraphs, where is the schema subgraph. Compared with the keyword query, the query subgraph not only contains the content information but also carries structural and semantic information. Thus it can more accurately reflect the users’ query intention.

3.3.1. Translation from Keywords to Triples

In order to identify users’ query intention, we should know exactly the role of a keyword in databases, that is, to know whether it is Metadata or content data. Thus each keyword should be extended to include the table name, attribute names, and attribute values of the table where the keyword is located. The paper proposes the triple structure to describe the above information. Before introducing the translation algorithm of triples, we first define the following three statistics tables: statistics table of table names, statistics table of attribute names, and statistics table of attribute values. They make statistical analysis on the table names, attribute names, and attribute values, respectively. These three statistics tables have similar structures, as shown in Tables 2, 3, and 4.

By analyzing the statistics tables, it is obvious that every keyword in query may correspond to several different semantic interpretations; namely, keyword can be translated to a set of triples. Query is also translated to several sets of triples. In most relational databases, the numbers of table names and attribute names are far less than the number of attribute values, so most ambiguities occur during the translation process of attribute values. Therefore, when we conduct the translation, we first try to match the keywords to the table names and attribute names. Then we match the remaining keywords to the attribute values to get the final sets of triples. Algorithm 2 shows the details of this translation process.

Input:
Output: ;
Method:
    if ( is not Null) then
     for to do
       scan Table ;
       if (matched() == TRUE) then
         .add To();
         .remove();
       end if
     end for
    end if
    if ( is not Null) then
     for to do
      scan Table ;
      IsAN = FLASE;
      foreach AN i in AN do:
       if (matched() == TRUE) then
        .add To();
        IsAN = TRUE;
       end if
      end foreach;
      if (IsAN == TRUE) then
       .remove();
      end if
     end for
    end if
    if ( is not Null) then
       for to do
       scan Table AV;
       IsAV = FLASE;
       foreach AV i in AV do:
         if (matched() == TRUE) then
           ().add To(AVS j);
           IsAN = TRUE;
         end if
       end foreach;
       if (IsAN == TRUE) then
          .remove();
       end if
       end for
    end if
    ;
    for to do
        Lth = Length();
        for to
         
              ;
               is the th
        end for
    end for
    for to do
        Lth = Length();
        for to
         
              ;
                                 is the th
        end for
    end for
    Result: ;

Lines , , and in Algorithm 2, respectively, match the keywords in query to statistics of table names, statistics of attribute names, and statistics of attribute values. Triples corresponding to each keyword in query can be obtained. As each keyword may correspond to more than one attribute name or attribute value, lines and handle these cases, respectively. It makes the different attribute names or attribute values corresponding to the same keyword translated to multiple sets of triples. All possible sets of triples are generated by the algorithm.

3.3.2. Query Subgraphs Construction

Before the construction of query subgraphs, we should combine the triples in . The paper makes the following merger rules.

We assume that is the set of triples corresponding to query . will be added to a new group, if satisfied the following three cases.

Case 1. The table name of is different from all the table names of .

Case 2. The table name of is the same as the one of , , but the attribute name and attribute value of are null.

Case 3. The table name and attribute name of are the same as the ones of , but the attribute of is single-valued.

These merger rules are intended to add the triples corresponding to different entities into different groups, so that the interpretation process of triples is further refined. After the translation from keywords to triples and the mergers of triples, this paper translates the merged triples to query subgraphs, where contains all the triples belonging to the same entity. Algorithm 3 describes the construction process of query subgraphs in detail.

Input: Merged triples , Schema
graph
Output: Set of Query Subgraphs
Method:
   ; ;
   Let be a query subgraph;
   for to do
       Create a node for group of ;
       Let corresponds to in ;
        ;
   Insert into ;
        ;
       foreach intermediate node in do
        Create a node and insert it into ;
       foreach edge in do
        Create an edge in ;
   Add into ;
   Return ;

Lines of Algorithm 3 initialize variables. Lines create node for the triple group in triple queries and add it to the query subgraph . Line traverses the schema graph to match the minimal subgraph corresponding to the above nodes. Lines and add the intermediate nodes and edges of to query subgraph , respectively. Lines return the set of query subgraphs.

4. Results and Discussion

Our experiments are conducted on the real dataset DBLP [29]. The experiments mainly deal with the selection of balance factor in the calculation process of term similarity and the performance evaluation of our query expansion method ReInterpretQE. Section 4.1 introduces the dataset, query sets, and experimental environment used in the experiment. Section 4.2 compares the precisions of the term similarity calculation with different parameters to choose the optimal value of . Section 4.3 gives contrast experiments using Metadata [3] and -coupling [11] as the baselines. The performance of these algorithms is evaluated, respectively. Performance metrics include precision, recall, and -score. The experiment is used to verify the performance of the method ReInterpretQE.

4.1. Experimental Setup
4.1.1. Dataset

The paper uses the DBLP dataset released in March 2015 [29]. The address of downloading is http://dblp.uni-trier.de/. The main statistics of the dataset are as shown in Table 5. DBLP is a computer bibliography dataset widely used for query expansion in relational databases. The dataset records the information about papers published by scholars. Its original form is XML and we use the Java SAX API to parse the XML file. Then we can obtain five data tables, where tables Author, Paper, and Conference contain information about scholars, papers, and conferences, respectively. Tables Cite and Write are relationship tables. The former specifies the reference relationships and the latter contains the writing relationships between scholars and papers. Figure 4 shows a sample of five tables from the DBLP dataset. The database DBLP contains a number of tuples. They have semantic relevance in content and primary-foreign key relationships in structure. So the DBLP dataset is very appropriate for testing the performance of our query expansion method.

4.1.2. Query Sets

In the experiment, we invite researchers to choose keywords from DBLP dataset and then build the keyword query they want to perform. By this method, 6 sets of queries with length ranging from 1 to 6 are obtained to form the query sets. Each set contains 10 queries. According to the above method, the researchers constantly submit queries in an extensive scope and we collect 600 queries from researches as the query history.

4.1.3. Experimental Environment

Our experiments are performed on a computer running Windows 10 with Intel(R) Core(TM) i5-4570 CPU @ 3.20 GHz, 4 GB of RAM, and 1 TB Disk. All the algorithms are implemented in Java.

4.2. Parameter Setting

The experiment in this section is to evaluate the impact of on the precision of calculation results and provides us guidelines in choosing a good value of . First, we randomly select 8 keywords from the query sets in Section 4.1. For each keyword , we can get the corresponding top-6 related keywords by formula (10). Further, we can get 11 different sets of results by adjusting the parameter from 0 to 1. Second, we integrate the related keywords in the above results to get the set (set size ≤ 66). Finally, we calculate the cooccurrence rate of keywords in set and . Mark the keywords ranked top-10 as real set related to . For keyword and based on real set, the precisions under different are as follows: the ratio between the number of keywords related to through formula (10) and the total number of the received keywords (13). The precisions corresponding to 8 keywords are summed and averaged to get the final precision. Figure 5 illustrates how the precision is adjusted by the parameter .

As shown from Figure 5, the precision of the terms similarity calculation is the maximum, 0.87, when equals 0.5. So we choose the parameter for the following experiments.

4.3. Performance Study

In this subsection, we report the performance of the query expansion method ReInterpretQE in comparison with the state-of-the-art approaches, Metadata [3] and -coupling [11]. The performance is measured by three evaluation metrics: precision, recall, and -score. A corresponding SQL statement is generated for each query in the query set. The results obtained through SQL statement are perceived as real query results and added in the test set Test. Given the result set Result and real result set Test, the precision, recall, and -score can be calculated as follows:

Figures 6, 7, and 8 show the comparisons in terms of precision, recall, and -score. Each data point on the -axis of Figures 6, 7, and 8 corresponds to the number of keywords. The -axis presents the corresponding precision, recall, and -score.

As we can see in Figure 6, the query expansion method ReInterpretQE proposed in this paper achieves much higher precision than the methods such as Metadata and -coupling significantly. For example, the average precision of ReInterpretQE reaches 0.81, 20.9% and 26.6% higher than Metadata and -coupling, respectively. Specifically, when the number of keywords is 4, the precision of ReInterpretQE is 0.76, while the precision of Metadata and -coupling is 0.64 and 0.62. The ReInterpretQE method increases the precision by 18.8% and 22.6%, compared with Metadata and -coupling, respectively. Overall, the precision of Metadata method is little higher than -coupling. And the precision of ReInterpretQE is improved obviously compared with the other two. This comparison shows the significance of our proposed query interpretation, which can help to describe the semantics of keyword queries and thus significantly improve the query precision. More specifically, the reason for the poor performance of -coupling, as compared to Metadata and ReInterpretQE, is that -coupling method focuses on identifying a set of keyword queries related to the given keyword query. The expanded queries obtained by the -coupling method are still keyword queries without the structure information between keywords. Yet the inherent ambiguity of keyword queries may directly affect the query precision. Methods Metadata and ReInterpretQE transform the initial keyword query into SQL and query subgraphs, respectively. The expansions of both methods contain structural information, which is helpful to improve the query precision. The reason for effective and improved expansion of ReInterpretQE over Metadata is that Metadata does not consider the multiple connection between keywords and various explanations for a keyword query, while ReInterpretQE translates the keyword query to a set of query subgraphs with structural and semantic information through the translation algorithm of triples and construction algorithm of subgraphs. The subgraphs can locate the query results more precisely.

To further evaluate the performance of our method, we vary different numbers of input keywords and compare the corresponding recalls. Figure 7 plots the average recall of the different query expansion methods. The evaluation results show that ReInterpretQE generally produces expansion of higher recall compared to the other two, suggesting that query recommendation is crucial to obtain good performance. More precisely, when the number of keywords is 4, the recall of Metadata and -coupling is 0.63 and 0.73, respectively, while the one of ReInterpretQE is 0.80. As expected, the recall of ReInterpretQE is approximately 9.6% higher than that of the -coupling method, and it is even higher when compared with the other method, Metadata. We observe that the methods -coupling and ReInterpretQE beat Metadata significantly. This is so because the Metadata method does not take similar words and related items into account, while the methods -coupling and ReInterpretQE deal with the problem accordingly, which can help to progressively and efficiently make query expansion and thus lead to higher query recall. ReInterpretQE always generates better results than -coupling. For example, ReInterpretQE achieves 0.82 average recall, which leads to about 5.1% over -coupling. The reason for this phenomenon is that -coupling only uses keyword coupling relationship matrix to analyze the original query, while ReInterpretQE conducts the similar words and related items expansion for every keyword and expands the query to a query list using probabilistic recommendation algorithm. ReInterpretQE builds Viterbi model using dynamic programming and generates the query list related to query input. After this operation, the query results can include more complete and comprehensive information. And the recall of results has been further improved.

Figure 8 further illustrates that the query expansion method ReInterpretQE outperforms the baseline methods through the comparison analysis of these methods on -score. The overall trend is clear. At all thresholds we evaluated, the results produced by ReInterpretQE are significantly better than the other two methods. For example, when the number of keywords is 4, the -score of Metadata and -coupling is 0.63 and 0.67, respectively, while the -score of ReInterpretQE is 0.78. The ReInterpretQE method increases the -score by 23.8% and 16.4%, compared with Metadata and -coupling. We investigated the reasons that ReInterpretQE has higher performances than the baseline methods. On the one hand, ReInterpretQE calculates the term similarity considering both intrasimilarity and intersimilarity. Thus the similarity calculation between terms is more reasonable. Then Algorithm 1, probabilistic recommendation algorithm, is proposed based on the term similarity. It constructs the Viterbi model using dynamic programming, which can improve the query recall. On the other hand, ReInterpretQE designs Algorithm 2, translation algorithm of triples, to perform the translation from keywords to triples. Then Algorithm 3, query subgraph construction algorithm, is used to transform the triples to query subgraphs. In the construction of the query subgraphs, ReInterpretQE not only considers the expansion in the structure and content of keyword query, but also considers the various explanations for a keyword query. So the query precision is further improved.

Summary. Based on this observation, we realize that ReInterpretQE indeed boosts the performance of query expansion and has a clear positive effect on quality of query results. Thus ReInterpretQE can be considered as a quite effective and practicable algorithm for query expansion.

5. Conclusions

Aiming at addressing the problems of semantic fuzziness and expression limits, the paper proposes a novel two-step query expansion method, ReInterpretQE. The method translates the keyword query to query subgraphs with potential structural and semantic information. Compared with the traditional methods, the method completes the query expansion and analysis only depending on the structure and content information of databases, without the requirement of query logs. In addition, the method uses query recommendation and query interpretation to balance the precision and recall of query results. Finally, experimental results on DBLP dataset verify the effectiveness of the proposed method. There are many open questions in the research of query expansion in relational databases. In the future work, we will make further research and discussion on it. For instance, we will take into account the influence of feedback on the performance of query expansion.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Yingqi Wang and Nianbin Wang conceived and designed the experiments; Lianke Zhou performed the experiments; Lianke Zhou analyzed the data; Yingqi Wang wrote the paper.

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China under Grant nos. 61272185 and 61502037, the Fundamental Research Funds for the Central Universities (no. HEUCF160602), the Natural Science Foundation of Heilongjiang Province of China under Grant nos. F201340 and F201238, and the Basic Research Project (nos. JCKY2016206B001 and JCKY2015206C002).