Abstract

XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP.

1. Introduction

Nowadays, the XML (http://www.w3.org/TR/xml11/) research is willing increasingly more documents having the structure with respect to certain structural [1]. Exploiting this structure is a significant part of improving retrieval effectiveness which can be divided into two categories: using document structure and user queries. Several form of the document's structure based retrieval models have been developed, such as BM25F [2] ranking function that is composed of several document fields with potentially different degrees of importance; PRM-S [3] is based on probabilistic retrieval model; and FRM [4] is the relevance feedback function based on the language model. Broschart and Schenkel presented the proximity weighting to improve the search system [5]. On the other hand, it is based on user queries, such as QRX [6] which is based on tree matching model without knowing the exact structure of the data, using the similarity measure of the vector space model. Unfortunately, this method has a drawback on the efficiency issue. The weight has been based on depth of the path and location in the document logical structure and then used as probabilities function based on the language model [7]; the length has been used as a normalization incorporated through a prior probability in the ranking function [8]. In [9, 10], highlight the structure weight in TopX (http://topx.sourceforge.net/) search engine. It assigns a small constant and tunable score for every navigational condition that is matched to query by using the frequency of the tag name. The weight has also been calculated based on the distribution of tag names which is used in a way similar to the binary independence retrieval model, but investigating the presence of tags in relevant and nonrelevant elements, to estimate the tag weights [11]. In [12], it is shown the structure does not improve the effectiveness of the retrieval system much because the users are very bad at giving structural hints with respect to INEX-IEEE collection and it requires further investigation. In this paper, we are investigating retrieval technique and related issues over a strongly structured collection of XML documents with the Initiative for the Evaluation of XML Retrieval (INEX) (https://inex.mmci.uni-saarland.de/) collections based on user queries. With richly structured XML data, we have been shown that the structural information using the Exponentiation function could be utilized to improve the effectiveness of search systems.

This paper is organized as follows. Section 2 reviews the data model and notions. Section 3 explains the presents state of the art approaches. Section 4 shows the experiment results and discussion; conclusions and further work are drawn in Section 5.

2. Data Model and Notions

In this section, we provide some historical perspectives on areas of XML research that have influenced this article as follows.

2.1. XML Indexing Methods

The basic XML data model is a labeled, ordered tree. Figure 1 shows the data tree of an XML document based on the node-labeled model.

Classical retrieval models have been adapted to XML retrieval. Several indexing strategies have been developed in XML retrieval as shown in Figure 2.

Element Base indexing [8] allows each element to be indexed on the basis of both direct text and the text of descendants. This strategy has a major drawback in that it is highly redundant. Text occurring at the nth level of the XML logical structure is indexed n times and thus requires more index space. This strategy is illustrated in Figure 2(a), where all elements are indexed. Leaf-Only indexing [13] allows indexing of only leaves through element or elements directly related to text. This strategy addresses the redundancy issues noted above. However, the propagation algorithm for the retrieval of nonleaf elements requires a certain level of efficiency. This strategy is illustrated in Figure 2(b), where the leaf elements are indexed. Aggregation-Based indexing [14] uses the concatenated text of an element to estimate a term statistic. This strategy has been used to aggregate term statistics directly on the basis of the text and its descendants. This is illustrated in Figure 2(b), where the leaf elements are indexed. Selective indexing [13, 15] involves eliminating small elements and elements of a selected type; this strategy is illustrated in Figure 2(c), where only semantic elements are indexed. Distributed indexing [15] is separately created for each type of element in conjunction with the selective indexing strategy, as shown in Figure 2(c). The ranking model runs each index separately and retrieves ranked lists of elements. These lists are merged to provide a single rank across all element types. To merge lists, normalization is performed to take into account the variation in elements size across the different indices such that scores across indices are comparable.

2.2. XML Query Languages

Querying in structured documents must be with respect to content and structure. INEX identified two types of queries [23, 24]; they are content only (CO) and content and structure (CAS) as follows.

2.2.1. Content Only Queries

These queries are formed by ignoring the document structure, in the same way as the traditional queries used in IR collections. However, they pose a challenge to XML retrieval in that the retrieval results in returning document components, that is, XML elements instead of whole documents in response to a user query. Queries can be elements of various complexities, that is, at different levels of the XML document's structure. This is suitable for XML retrieval where users do not know or are not concerned about the structure, that is, with the logical organization of the document, when expressing their information needs. For example, the best answer for a query “XML retrieval” applied to Figure 1 may be a “section” and not “title” or “p” elements.

2.2.2. Content-and-Structure Queries

These queries contain conditions of both content and structure. These conditions may refer to the content of specific elements and specify the type of requested answer elements. However, the complexity and the expressiveness of content-and-structure query languages are difficult for the end users because they have to know the logical organization of the document when expressing their information needs. Trotman and Lalmas [12] showed that the structure did not improve the effectiveness of the retrieval system very much because users were normally not capable of giving useful structural hints with respect to INEX-IEEE collection. However, the content-and-structure query can be very useful for expert users in specialized scenarios.

2.2.3. The Narrowed Extended XPath I

The Narrowed Extended XPath I (NEXI) query language was developed at INEX [25] as a simple query language for content-oriented XML retrieval evaluation. The enhancement comes from the introduction of a new function named “about()”. The “contains()” function of XPath, which requires an element (its text) to contain the given string content, was replaced by the “about()” function, which requires an element to be about the content. The NEXI query provides support for the descendant axis as follows. is simple elements with paths matching and contents about . returns elements which are descendants of the element , where the element contains . returns elements which are descendants of the element , where the element contains and the element contains .

2.3. Structure Weight IR

Schlieder and Meuss presented the QRX [6] which is based on tree matching without knowing the exact structure of the data of the similarity measure of the vector space model; an element score is computed as follows:

Stephen et al. [2] and Robertson and Zaragoza [26] present BM25F as an extension of the baseline BM25 [27] scoring function that is adapted to score field documents. Using the BM25F scheme presented in [28], an element score is computed as follows: where measures the relevance of element to query ,  is a weighted normalized term frequency,   is a common tuning parameter for the BM25, and is the inverse document frequency weight of term .

The weighted normalized term frequency is obtained by first performing length normalization on term frequency of term in field in element as follows: where  is a smoothing parameter,   is the length of field , and is the average length of elements in the entire collection after multiplying the normalized term frequency by field weight :

Kim and Croft [4] recently introduced the Field Relevance Model (FRM). FRM employs the notion of field relevance and a corresponding retrieval model between query terms and document fields, which are calculated by Field Relevance given a query = , and field relevance   is the distribution of per-term relevance over document fields. Field Relevance Model is based on field relevance estimates ; the Field Relevance Model combines field-level scores for each document using field relevance instead of weights as follows:

Broschart and Schenkel [5] presented the use of proximity-aware scoring functions that lead to significant effectiveness improvements for XML retrieval. This method introduces modified proximity scores that take the document structure as follows:

To compute the proximity part of the score for each term , at first compute an accumulated score that depends on the distance of this term's occurrences in the element to other terms, adjacent query term occurrences using for each adjacent occurrence of a term at distance to an occurrence of , the grows by . The proximity score is computed as follows: where measures the relevance of element to a query ,  is calculated by .

Ogilvie and Callan [7] is based on language models and employs element-based indexing. Given a query , terms for each element and its corresponding element language model , the element is ranked as follows: where   is the probability of relevance for element and  is the probability of the query generated by language model . For instance, where   is estimation of term in element ,  is the probability of term in collection , and  is the smoothing parameter.

To account for the length of an element , and in particular for the heavily biased distribution of small elements in XML documents, which can be used to set as follows [8]: where   is the length of element and  is the length of element occurring in collection .

Theobald et al. [10] present the extended BM25 function in the TOPX, which is known as the Compactness of the baseline BM25 as follows: where  is the length of element with tag ,  is the average length of elements in the entire collection with tag , , and is a common tuning parameter for the BM25.

The modified function provides a dampened influence of the with tag . However, this strategy is limited in that each tag name must be the same to implement automatic grouping and weight calculation.

The idea is to associate a weight to a structural constraint to reflect its significance. These weights are then used in the scoring function used to estimate an element relevance.With the increased availability of the data-centric a need for query in both structure and content of the XML documents has become explicit. As a result, a more complex information source is available, in fact, allowing us to improve the performance of search systems. Our approach considers the use of structure weight method, as discussed in Section 3.

3. Method

In this section, the search results become more refined at every step, and the refinement ultimately narrows down a set of potentially interesting documents. Below we describe our approach in more details.

3.1. Step  1: Elements Score

Firstly, we defined is a score for the relevance of a term of an element and then we used the baseline BM25 [27] in Sphinx (http://sphinxsearch.com/) [29] formula to score the element nodes according to query terms contained in content conditions as follows: where  measures the relevance of element to query term ,  is the frequency of term occurring in element ,  is the length of element ,  is the average length of elements in the entire collection, and and are used to balance the weight of term frequency and element length.

And then, we compute the inverse element frequency as follows: where  is the inverse element frequency weight of term ,  is the total number of an element in the entire collection, and  is the total element of a term     occur.

For an “” function in NEXI operator with multiple terms that appeared to an element , the aggregated score of is simply computed as the sum of the element's scores for each term conditions as follows:

3.2. Step  2: Score Sharing Function

In the second step of our approach [30], we compute the scores of all elements from (14), in the collection that contains query terms. We consider the scores of elements by accounting for their relevant descendants . The scores of retrieved elements are now shared between the leaf node and their parents in the document XML tree according to the following scheme: where  is a current parent node,  is a relevant child of element , and is a tuning parameter.IF   THEN preference is given to the leaf node over the parents.OTHERWISE, preference is given to the parents. is the distance between the current parent node and the leaf node.

3.3. Step  3: Exponentiation Weight Function

The third step of our approach is the structure score evaluation. To improve the search result with richly structured, we assume that a query is composed of content (keywords) and structure constraints. The document-query similarity is evaluated by considering content and structure separately. We then combine these scores to the set of possibly relevant elements. Our structural scoring model essentially counts the number of navigational (i.e., element name-only) query conditions that are satisfied by a result candidate and thus considering the content conditions matched for the user queries. It assigns for every directional condition that matched the element name (i.e., an absolute path on the document structure). We analysed the structure for each topic in INEX as shown in Table 1 with respect to the INEX content-and-structure queries and each topic is including a few structure indications. Thus, we are proposed the novel of structural scoring when the user query is matching the structural constraints against the document tree using the Exponentiation is .

In order to evaluate the sensitivity of the Exponentiation, we have variation in the value of parameter, including base 10, base e, base 2, and base 1/2 as shown in Figure 3. According to the trend of the graph more smooth than other values and the powers of are important in computer science because there are possible values for an n-bit binary variable. Thus, we simply for our algorithm calculate base on . After that we recomputed the element score as follows: where  is the frequency of navigational condition that is matched with the .

In the following, we define as the set of all elements in that match the target element of the query. In document mode, every document inherits the aggregated score among all target elements , and these document scores determine the output ranking among documents as follows:

To see how users use structure in their queries, for instance, the user query needs “retrieve document sections with the paragraph contains xml retrieval” as follows:

The first filter looks for occurrences of the term “xml” and “retrieval” in elements whose context matches the path “//section//p” on the . It is possible to assigning more weight for the return element . In this case, we assume the for each element is 10, is 0.7 and then the calculations are shown in Figure 4.

Thus, the for the document is .

4. Experiment Setup

In this section, we present and discuss the results based on the INEX collection. This experiment was performed on Intel Pentium i5 4 * 2.79 GHz with 6 GB of memory, Microsoft Windows 7 Ultimate 64 bit Operating System and Microsoft Visual C♯.NET 2008.

4.1. INEX Collection

The INEX-IMDB collection used in INEX 2010 (https://inex.mmci.uni-saarland.de/) was generated from the plain text files published on the IMDB web site on April 10, 2010. There are two kinds of objects in the collection, movies and persons involved in movies. Each object is richly structured. For example, each movie has title, rating, directors, actors, and so forth; each person has name, birth date, and so forth. In total, the IMDB data collection contains 4,418,081 XML documents, including 1,594,513 movies, 1,872,471 actors, 129,137 directors who did not act in any movie, 178,117 producers who did not direct or act in any movie, and 643,843 other people involved in movies who did not produce or direct or act in any movie.

4.2. INEX Evaluations

The effectiveness of the retrieval results will be evaluated using the metrics as that in traditional IR, for example, precision, recall, MAP, P@10, P@20, and P@30 [31, 32]. Given a topic and a set of documents , each tested IR system returns an ordered subset of , ranked by the system's estimate of the likelihood that each document is relevant to . Several effectiveness measures are computed, including average precision (AP); precision at returned documents (P@k) defined as follows:

Performance across a set of topics is measured by calculating the mean of the values obtained by the measure for each individual topic, resulting in MAP. Assuming there are topics:

4.3. Results and Discussion

In this section, we tuned the parameter using INEX-2005 ad hoc track evaluation scripts distributed by the INEX organizers. Our tuning approach was such that the sums of all relevance scores are maximized and then the total number of leaf node is and the parameter is set to . Following that, we used the Sphinx parameters for the BM25 where and and the entire Sphinx match mode values in our experiment include MATCH ANY (TF), MATCH PHRASE (PHRASE), and MATCH EXTENDED (BM25) and are provided in Table 2. The main components of the MEXIR [33] retrieval system are as follows.(1)When new documents are entered into the system, the Absolute Document XPath Indexing (ADXPI) [34] indexer parses and analyzes the name of each element and its position to build inverted lists for each index in this system.(2)The SphinxDB search engine is used to build both indices in the system. The Selected Weight index is based on term frequency, and the Leaf Node index is based on the classic BM25 function.(3)The Score Sharing function is used to assign parent scores by assigning a proportion of the scores of the leaf nodes to their parents using a top-down approach.(4)The Exponentiation function is used to adjust the element scores based on linear combination.

The MEXIR search engine retrieves XML elements based on the leaf node indexed with respect to the significant words including the Exponentiation and Score Sharing functions, and then we combine relevance score from the element into the document score. Thus, the document with the higher relevance score will be chosen as the retrieval set. The details of experiment are shown in Table 3.

The performance of different features and ranking methods can now be evaluated. In order to deepen into the analysis of the Exponentiation scoring function, we have also run experiments to study the impact of structure weight with the content-and-structure query in the performance. Table 4 shows the results compared for the best performing runs with and without Exponentiation technique. The p16-BM25-EXPO used the Exponentiation for boosting element score, and the p16-BM25 is the baseline BM25 and then the Exponentiation function was shown to improve the effectiveness of search system measured in terms of MAP, P@10, P@20, and P@30 and are 52.60%, 50.60%, 54.16%, and 58.79%, respectively. Table 5 shows the results compared for the best performing runs with and without the Score Sharing technique. The p16-BM25-EXPO is used the Exponentiation and the used the Score Sharing is the p16-SS-SW and then the Exponentiation weight shown improve the effectiveness of over the Score Sharing technique measured in terms of MAP, P@10, P@20 and P@30 are 81.58%, 82.92%, 75.09% and 67.83%, respectively. It can be seen, that p16-BM25-EXPO obtained the best performance, although the improvement over both the baseline BM25 and the Score Sharing is significant for most of the considered metrics. The significance () was computed with a 2-tailed t-test as shown in Table 6. The p16-BM25-EXPO improved by 0.48% over the baseline BM25 at MAP, and 0.75% over the baseline BM25 with the Score Sharing at MAP on INEX-IMDB collection.

In this analysis, we take the results that were obtained from BM25 over the Exponentiation and compare them with the results from the baseline BM25 and over the Score Sharing function. It is shown again that Exponentiation works well with the document-centric XML documents. We can conclude that significant improvement of results of the Exponentiation function can be obtained from the content-and-structure query and document structure. This finding suggests that it is possible to improve the TF, PHRASE, and the baseline BM25 approaches, which are the usual benchmarks in INEX. The main conclusion that can be drawn from the experiments is that the Exponentiation function is successful in structure weight and could be utilized to improve the effectiveness of search systems.

Another major conclusion, is that we analyzed the effectiveness of the runs for each of the three topic types with respect to the INEX [17] and the results are presented in Tables 7, 8, and 9. The overall results are satisfactory if we compare them with those obtained by participants in the INEX contests. On comparing the effectiveness for the informational topics, our run ranked first, scoring 0.3564, measured with MAP; it ranked fifth scoring 0.6667, measured with 1/Rank for the known-item topics; and in the results of the list topics, our run ranked first, scoring 0.4251, measured with MAP.

In this analysis, we take the results that were obtained from the INEX report [17]. It is shown again that our system works well with the List and Informational topics of the document-centric XML documents measured with the MAP metric. Unfortunately, on the known-item topics, the relevant answer is a single document; in this area, the performance was not satisfactory and so further investigation is required.

5. Conclusions

With the increased availability of the data-centric a need for query in both structure and content of the XML documents has become explicit. As a result, a more complex information source is available, in fact, allowing us to improve the performance of search systems. In this paper, we are investigating retrieval techniques and related issues over a strongly structured collection using the Exponentiation weight for the document's structure over the content-and-structure query, in the data-centric track of the INEX 2011. Our expectation is that structure weighted will improve the effectiveness of the search systems. In terms of processing time, our system required an average of one second per topic. In addition, our run for the ad hoc task showed that the structural information could be utilized to improve the effectiveness of the search system over the baseline BM25 measured in terms of MAP, P@10, P@20, and P@30 and are 52.60%, 50.60%, 54.16%, and 58.79% and over the Score Sharing technique measured in terms of MAP, P@10, P@20, and P@30 and are 81.58%, 82.92%, 75.09%, and 67.83%, respectively. The success of our ad hoc run indicates that indexing the complete XML structure of IMDB and the structure weights are necessary for effective document retrieval in the search system.

In future work, we will look closer at the relative value of various types of metadata, tags, and subject headings. We will also look at the different weighting methods underlying the relevance judgements and topic categories, such as blind feedback and recommendation search.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.