Abstract

This paper describes a keyword search measure on probabilistic XML data based on ELM (extreme learning machine). We use this method to carry out keyword search on probabilistic XML data. A probabilistic XML document differs from a traditional XML document to realize keyword search in the consideration of possible world semantics. A probabilistic XML document can be seen as a set of nodes consisting of ordinary nodes and distributional nodes. ELM has good performance in text classification applications. As the typical semistructured data; the label of XML data possesses the function of definition itself. Label and context of the node can be seen as the text data of this node. ELM offers significant advantages such as fast learning speed, ease of implementation, and effective node classification. Set intersection can compute SLCA quickly in the node sets which is classified by using ELM. In this paper, we adopt ELM to classify nodes and compute probability. We propose two algorithms that are based on ELM and probability threshold to improve the overall performance. The experimental results verify the benefits of our methods according to various evaluation metrics.

1. Introduction

Traditional databases only manage deterministic information, but many applications use databases to involve uncertain data such as information extraction, information integration, and web data mining. Because of the flexibility of XML data model, it can easily allow a natural representation of uncertain data. Now, many probabilistic XML models are designed and analyzed [14]. This paper selects a popular probabilistic XML model [5], which is discussed in [6]. In this model, a probabilistic XML document (called a -document) is considered as a labeled tree which has two types of nodes, nodes and nodes. Ordinary node is used to represent the actual data and distributional node is used to represent the probability distribution of the child nodes. There are two types of distributional nodes, IND and MUX. If a node is an IND node, its children nodes are of each other, while the children of a MUX node are ; that means, at most, one child can exist in a random instance document (a ). A real number from is attached on each edge in an XML tree, indicating the conditional probability that the child node will appear under the parent node given the existence of its father node. From the attribute of a MUX node, we can see that the sum of all the existence probabilities of children nodes is 1 or less than 1.

Keyword search has been widely applied on XML data. It is considered to be an effective information discovery method to query XML data. Users do not need know the knowledge of the underlying data structures and complex query language beforehand. So, keyword search is an easy method for ordinary users to retrieve information. Keyword search on XML data is different from the query on text data. As a result, a subtree rooted at a common ancestor node will replace the whole text data. In the past years, the definition of a common ancestor node has several choices, such as LCA (lowest common ancestor), SLCA (smallest LCA), and ELCA (exclusive LCA). These definitions are used to determine the users’ query intentions. SLCA and ELCA are the subset of LCA by adding some restrictive factor. In many cases, the size of a set determines the accuracy of the query. This paper selects SLCA as the root node of result subtree because that SLCA nodes set is the smallest set in all the definitions based on LCA.

It is known that both neural networks and SVM () have been playing the dominant roles out of numerous computational intelligence techniques. But they face three challenging issues such as slow learning speed, trivial human intervene, and poor computational scalability. ELM [7, 8] as emergent technology works for generalized single-hidden layer feedforward networks (SLFNs). ELM [912] has good performance on classification applications and can be used to classify nodes before query XML data. Classification is considered as an important cognitive computation task [1316]. An XML data tree can be seen as a set of all the nodes including root node (only one), connected nodes, and leaves nodes. A connected node has only one father node and one or more children nodes. The keyword usually appears in the leaves nodes or its father node of a leaf node. So, the classification needs to consider two kinds of information, and they are keyword information and structural information. XML contains some structural information such as the element-subelement relationships and the element-value relationships. The element-subelement includes ancestor-descendant relationship, and father-child relationship. In addition, sibling relationship is an important relationship. If the number of keywords is more than one, the relationship between nodes plays a crucial role in the keyword search on XML data. So, the classification needs to think out structural information and keyword information on an XML data tree. The classification method is presented in Section 3.

This paper is organized as follows. Section 2 introduces the probabilistic XML model and the formal semantics of keyword search result on probabilistic XML data. Section 3 shows how to classify nodes and the calculation method of probability. In Section 4, we propose an algorithm to query keyword on probabilistic XML data by using ELM to classify nodes. Section 5 introduces the method which descript the impact of the probability threshold. The experimental and performance evaluations are presented in Section 6. Sections 7 and 8 give the related work and the conclusion.

2. Problem Definitions

2.1. Probabilistic XML Data

A probabilistic XML document (-document) can be seen as a set of many deterministic XML documents. Each deterministic document is called a possible world. A probabilistic XML document represented as a labeled tree has nodes representing actual data and nodes representing the probability distribution of the child nodes. Ordinary nodes are prime XML nodes and they always appeared on deterministic XML data and probabilistic XML data. Distributional nodes are only used to define the probabilistic process of generating deterministic documents, while those nodes do not occur on deterministic XML data. This paper adopts as the probabilistic XML model. For example, Figure 1 shows a -document . Ordinary nodes are shown as a black solid point, for example, , , and . IND nodes are depicted as rectangular boxes with rounded corners, for example, IND1, IND2, and IND3. MUX nodes are displayed as circles, for example, MUX1.

A -document can generate all possible worlds (deterministic documents). Given a -document , we can traverse in a top-down fashion. When we visit a distributional node, there are two situations according to the different types. One situation is that if a node is an IND node with children nodes, we generate copies. We randomly select children nodes of the IND node into a copy. A copy is a subset of all children nodes. For each copy, the probability is the product of all existence probabilities of the children nodes in the subset and the absence probabilities of the children nodes not in the subset. Another situation is that if a node is a MUX node with children nodes, we generate or copies. If the sum of all the existence probabilities of the children nodes is 1, there are copies, otherwise the number of copies is . For each copy, the probability is the existence probability of the selected children node. If none of the children nodes has been selected, the probability is the absence probability. For example, Figure 2 shows the copies of a -document with their probabilities. Figure 2(a) select node as the only child node of node , and the probability is . If there is not any node selected as the children nodes of , Figure 2(d) shows the probability of this copy is . As shown in Figure 2(e), node selects nodes and as its children nodes, and node selects node as its child node. The probability is . The probabilities of the other copies () are easy to calculate from the above procedure.

2.2. Keyword Query

Usually, we model an XML tree as a labeled ordered tree, in which nodes represent elements and edges represent direct nesting relationship between nodes. Recently, keyword search has been studied in XML documents more and more. Given a set of keywords and an XML document, most work took LCA and SLCA of the matched nodes as the results. The function computes the Lowest Common Ancestor of nodes . Given keywords and the inverted lists of them. The LCA of these keywords on is defined as denote the child nodes of node on the path from to .

Definition 1 (SLCA on XML data). Given a query in an XML tree , an SLCA query finds the SLCA nodes which is the child node of other LCA nodes.

The SLCA is defined as follows:

For example, Figure 3(a) gives a traditional XML tree which is generated by Figure 1 and a query . The result of this query is shown in Figure 3(b). The node lab and person are LCA nodes. The SLCA node is person. According to the concept of SLCA, the node lab is an ancestor node of person. So, the SLCA node is person. From the definition of LCA and SLCA, we can see that SLCA is a subset of LCA.

This paper selects SLCA as the result for the keyword search on probabilistic XML data. Because SLCA is the smallest set, every SLCA node should be seen as a suitable result for users. SLCA node is the smallest common node of the nodes which contain keywords. As the result, we return the subtree rooted at SLCA node. So, when classification algorithm classify the nodes, it need put the keyword node and its all child nodes into one set.

A keyword search on -document consists of a -document , a query . We define the answer for a keyword search on as ordinary nodes on to be SLCA in the possible worlds which is generated by . The probability of a node being an SLCA in the possible worlds is denoted as . The formal definition is shown as follows: where is the existence probability of the possible world . denotes the possible worlds generated by . indicates that is an SLCA in the possible world .

can also be computed with (4). Here, indicates the existence probability of in the possible worlds. We can compute the existence probability by multiplying the conditional probabilities along the path from the root node to node . is the local probability for being an SLCA node in . denotes a subtree of rooted at . Consider

From (3), we can obtain the following equation to compute : where denotes the local possible worlds generated from . means that is an SLCA in the possible world rooted at ; namely, the root node is the only LCA node. The probability of this subtree rooted at node can be shown as

As an SLCA result of keyword search, the probability of node is

For example, we give an example to compute (we select the node person with the existence probability of ) in Figure 1. We can see that . . So, the result is .

Definition 2 (SLCA on probabilistic XML data). Given a query in a probabilistic XML tree , an SLCA query finds the SLCA nodes in all possible worlds with the probability of all the probabilities of the possible worlds in which the node is an SLCA node.

Definition 3 (Threshold SLCA on probabilistic XML data). Given a query in a probabilistic XML tree , an SLCA query finds the SLCA nodes in all possible worlds with the probability of all the probabilities of the possible worlds in which the node is an SLCA node. And the probability must be more than threshold .

If we do not consider the distributional nodes of the -document, we can see that every SLCA node is an LCA node. For a given keyword query and a XML tree, node is a common ancestor of on an XML tree if the subtree rooted at contains each keyword of at least once. For example, for query , the common ancestor nodes of are lab and person. If we built a tree contains common ancestor nodes according to the relationship on a XML tree, the leaves nodes are the SLCA nodes.

3. Classification of Nodes and Calculation of Probability

A -document contains two types of nodes, ordinary, nodes and distributional nodes. An ordinary node can represent actual data on probabilistic XML data, but a distributional can only represent the probability distribution of a node.

3.1. Classification of Ordinary Nodes

From Section 2, we can see that if we can find keyword nodes tree, the set intersection operation for keyword nodes tree should achieve SLCA nodes quickly. Figures 4(a) and 4(b) shows the keyword nodes tree. When we use set intersection operation to obtain the common ancestor nodes tree such as shown in Figure 5. So, the important section is how to receive the keyword nodes tree.

To receive the keyword nodes tree, we need to add dummy node for actual node which contains more than one keyword. If the subtree rooted at the node contains two keywords, we should add one dummy node as the sibling node of node . For example, node and in Figure 5 has its dummy node. These dummy nodes do not exist in the actual tree. The aim of adding dummy nodes is to classify nodes effectively.

3.2. Classification of Distributional Nodes

Distributional nodes can represent the probability distribution of the children nodes. A -document defines a probability distribution over a space of deterministic XML documents. According to the different types of the distributional node, the number of copies is different.

Case 1 (If a node is an IND node). It has children nodes, the number of copies is . If there is only one child node contains keyword and the probability of is , the probability of the subtree rooted as which contains is . If the number of children nodes which contains keyword is , the probability need consider all the situations which contains keyword . The sum of these probabilities is the probability of the subtree which contains keyword . Considering all the situations is a very complicated calculation process. So, we can first calculate the probability of the subtree which do not contain the keyword . So, the probability of an IND node is shown as (8).

Case 2 (If a node is a MUX node). The number of child nodes is or . Each copy has its probability value. Some of the copies will contain the keyword, and the copies which contain the keyword are important for our probabilistic keyword search. So, the probability of a MUX node is shown as (9).

Figure 6(a) shows an example of an IND node. For the keyword , IND1 is a father node of nodes and . The copy with the existing of has two situations, and their probabilities are and . It means that the probability of the subtree rooted at node IND1 contains the keyword is . Figure 6(b) shows an example of a MUX node. For the keyword , MUX1 is a parent node of node and . The copy with the existing of has two situations; the probabilities are all . It means that the probability of the subtree rooted at node MUX1 containing the keyword is .

For each keyword, all its ancestor nodes and itself nodes will constitute a tree. This tree contains ordinary nodes and distributional nodes. To present the probability contribution situation of this tree which contains keywords, we will delete distributional nodes and connect its children nodes to its father node with the existence probability of containing the keyword of the subtree rooted at its father node according to the type of distributional node. For example, Figure 7(a) shows a tree contained keyword Tom. Node is a father node of node . The probability of a subtree rooted at node which contains keyword Tom is . As shown in Figure 7(b), it is the situation of the probabilistic tree which contains keyword XML.

We merge all the keywords probabilistic trees together. It will generate a keyword nodes probabilistic tree. We need calculate SLCA nodes on this tree with the probability and delete the subtree rooted at SLCA nodes. Next, we need to continue to calculate SLCA results on remaining nodes tree. So, if we repeat such operation, all the SLCA results will be generated. For example, Figure 8 is a keyword nodes probabilistic tree. This tree retains all the probabilities of the subtree which contains keyword.

3.3. Probability Calculation

If we calculate SLCA results on the tree in Figure 8, the node which is shown in Figure 9(a) is the only result. So, we need to delete the subtree which is rooted at node , and the remaining nodes tree is shown in Figure 9(b). To repeat calculated operation on the remaining nodes tree, we can see that the node is another result. In this section, we introduce how to calculate the probabilities of all the SLCA nodes.

A distributional node can appear in all the positions in a tree excepted root node and leaf nodes. If node is an SLCA node and there are two keywords , the probability of node needs to compute 3 values from (7); they are and . As shown in Figure 10, represents the path from root node to node and express the path from node to , respectively. So, if there are keywords, we need to divide this tree into parts.

Next, we discuss how to compute probability according to 4 situations.

Case 3 (r1 r2 and r3 do not contain distributional nodes). If node is an SLCA node and there are no distributional nodes in and , that means this situation is the same as deterministic XML data. The probability is 1.

Case 4 (r1 contains distributional nodes). If contains distributional nodes, that means node will not appear in some possible worlds. These possible worlds do not contain node . All the distributional nodes in will influence other nodes’ probability in a keyword nodes probabilistic tree (just as in Figure 8). Figure 11 shows the processing of calculation in a keyword nodes probabilistic tree. The path from node to node contains a distributional node, and the probability of the path is 0.8. So, when we finish the calculation of node , the probability of node needs to add a new probability as shown in Figure 11.

Case 5 (r2 or r3 contains distributional nodes). If the path from node to any keyword matched node contains distributional nodes, it illustrates that this keyword matched node will not appear in any possible worlds. In this situation, each probability of keyword matched node all need be recorded, just as in Figure 7(a) and Figure 7(b). Figure 7(a) shows the keyword probabilistic tree. The node has two branches; one is the path which constituts nodes and , and another one is the path that contains nodes and . The probabilities in these two pathes are 0.7 and 0.8 as shown in Figure 7(a). The keyword is the same as shown in Figure 7(b). Figure 8 is a keyword nodes probabilistic tree. For the first keyword in the second path ( and ), the node contains another keyword XML. So, the node only need record the probability 0.7. For the second keyword XML, because the probability of the node containing it has the situation with value 1; other probabilities do not have to be recoded as shown in Figure 7(b).

Case 6 (r1 r2 and r3 all contains distributional nodes). This situation can be seen as the synthesis of two kinds of afore-mentioned circumstances. The processing method has been introduced in the previous illustration, and in this section we only need to judge which one we need to compute firstly. Because SLCA nodes are the child nodes in all the common ancestor nodes, SLCA results need to be computed according to the path from bottom to up. In Figure 10, the nodes in and are child nodes of the nodes in . The situation of has priority right. Next, the situation needs to be computed as the second step.

4. ELM Keyword Search on Probabilistic XML Data

Keyword search on probabilistic XML data based on classification mainly include four steps, they are shown as follows: adding dummy nodes according to the number of keywords, to classify nodes with ELM based on keyword according to the type of the nodes, useing the set merging operation to structure the common ancestor nodes probabilistic tree, and repeat the operation of calculating SLCA and deleting the subtree; all the SLCA results will generate. The key of the keyword search on -document is how to calculate the probability of the SLCA results. Step and Step all contain the computation of the probabilities.

Each node contains two kinds of information, they are code and keyword it contained. If a node is a distributional node, there are the third information in this node, that is, probability. Code is used to judge the relationships between nodes, such as finding the common ancestor nodes. The keyword which is contained in a node is the key of keyword search. When we use ELM to classify nodes, keyword can be used as the label of the classification set. Every set represents one keyword. For the given query, we will find all the sets of keywords which is given by users and operate set merging to obtain a keyword nodes tree.

Next, we introduce four steps of the keyword search algorithm using ELM to classify nodes on probabilistic XML data one by one.

First, adding dummy nodes according to the number of keywords. We can see that the probabilistic XML tree in Figure 1 contains two keywords . The algorithm uses Dewey code to encode the XML tree. So, the first step is adding the dummy nodes for the node which contains keywords in the subtree rooted at the node . If the subtree which is rooted at the node has keywords, we will add dummy nodes. For example, in Figure 1, the node , , and node are added into the XML tree as a part of the dummy nodes tree. Each node on the probabilistic XML tree has a table. As shown in Figure 9. From the number of the nodes which contains keyword, the dummy nodes for adding are shown in Table 1.

Second, to classify nodes with ELM based on keyword according to the type of the nodes, from the dummy nodes tree, all the nodes and the distributional nodes consist of the classified nodes. ELM can classify the nodes to two sets such as shown in Figure 12(a). The first set represents keyword Tom, and the second set represents keyword XML. Each distributional node has a probability which represents the keyword probability of the subtree which is rooted at the distributional node. For example, the node IND1 has the probability with , that means the probability of containing keyword Tom of the subtree which is rooted at IND1 is .

Then, we need to delete all the distributional nodes and connect all their children nodes to their parent node. The probability of distributional node will be moved to its child node. For example, in Figure 12(b), the node accepts the probability from its father node .

Third, use the set merging operation to structure the common ancestor nodes probabilistic tree. The intersection of the two sets is the set which includes node and . Figure 13 shows the union set of two keyword sets.

Finally, repeat the operation of calculating SLCA and deleting the subtree; all the SLCA results will generate. Let us calculate SLCA result on the tree which is shown in Figure 13. The node with the SLCA probability of is generated. So, the subtree which is rooted at will be deleted, and the probability will be leaves to the other nodes tree. Next, the node is another result. The probability of this result is . Because, the extensive probability of the node Tom is , and the extensive probability of node is ; the SLCA probability of is .

From (4) we can see that the product of the extensive probability and the SLCA probability is the result probability. The extensive probability is to the node , so the result probability is . Moreover, the extensive probability of the node is , and the result probability of the node is .

5. ELM-Threshold Keyword Search on Probabilistic XML Data

Prunning can speed up the retrieval speed, we can delete any nodes that are not SLCA nodes in the processing of calculation SLCA results. According to the definition of SLCA, if a node is an SLCA node, its all ancestor nodes are not SLCA nodes. If there are no distributional nodes in and , node is an SLCA node and its father node is not SLCA node. For example, node is an SLCA node, so its father node is not an SLCA result.

Some results’ probability is very small, these results have very low extensive probability. For most users’ query intension, if the probability is more than other nodes, that means that the node has high query value. So, this section introduces the concept of probability threshold.

Case 7 (r1 contains distributional nodes). Because node is an SLCA node, when we finish the calculation of node , its ancestor nodes need to add the probability . As shown in Figure 9, the value has been retained in the other nodes tree in Figure 9(b). So, if the probability value is less than the probability threshold, the probability of node must be less than the probability threshold. All the ancestors must be deleted, because they are not threshold SLCA nodes.

Case 8 (r2 or r3 contains distributional nodes). There are distributional nodes in or . When we finish the calculation of node and 2 keywords, there are situations will be remained. We need to consider all these 3 situations. Each situation has the probability . If the probability value is less than the probability threshold, the probability of node must be less than the probability threshold. All the ancestors must be deleted, because they are not threshold SLCA nodes. If there are no and in the subtree rooted at , the probability is . If the probability value is less than the probability threshold, the probability of node must be less than the probability threshold. All the ancestors must be deleted, because they are not threshold SLCA nodes.

For a probability threshold , when we calculate the probability on the keyword nodes probabilistic tree, if the probability is less than the threshold , the result is not an SLCA node. For example, if the probability threshold is , the SLCA probability of the second result is which is less than the probability threshold; the node is not a threshold SLCA node. When we compute the probability of the first result of the node , because the probability will be leaves to the other nodes tree, and the probability value is less than the probability threshold, the probability of other nodes tree must be less than the probability threshold. So, we need not calculate SLCA on the other nodes tree. When the first result is computed, the threshold SLCA is finished.

6. Performance Verification

In this section, the performance of keyword search used ELM to classify on probabilistic XML data is shown as follows. All the experiments based on ELM for classification algorithms are carried out in MATLAB 2007 environment running in a Pentium 4, 2.53 GHZ CPU.

The dataset we used is shown in Table 2. In this paper, the algorithm selects two datasets XMARK and DBLP. For each XML dataset used, we generate the corresponding probabilistic XML tree, using the same method as used in [5]. Table 3 shows the dataset and keywords. We visit the nodes in the original XML tree in preorder way. For each node visited, we randomly generate some distributional nodes as children of . For the original children of , we select them as the children of the new generated distributional nodes and assign them random probability distributions. We need a restriction that the sum of children nodes for a MUX node is no more than 1. The keyword has 8 situations. The number of keywords is 2 to 3.

Compared with those traditional computational intelligence techniques, ELM provides better generalization performance at a much faster learning speed and with least human intervention. We compare the query times of two situations about keyword search in probabilistic XML data. The first situation is using the method in [17] to retrieve the SLCA nodes, and the second situation is using SVM and ELM to classify nodes for the keyword search on -document. The second situation classifies nodes by using SVM and ELM. The speed of classification is shown in Figure 14.

From Figure 14 we can see that ELM has advantages of speed compared with Prstack and SVM. Prstack will compute all the nodes probabilities of all the ancestor nodes of the keyword node and it will record all the situations of the node which contains keywords. ELM can classify nodes according to the code and keywords by retrieving all the nodes once. So, the algorithm has high speed by using ELM to classify compared with SVM.

Next, we compare the query times of two situations about keyword search in probabilistic XML data. The first situation is using ELM to classify nodes for the keyword search on -document, and the second situation is adding the probability threshold to classify nodes for the keyword search on -document based on SVM. The third situation is adding the probability threshold to classify nodes for the keyword search on -document based on ELM. The speed of classification is shown in Figure 15.

From Figure 15, we can see that the probability threshold has advantages. If the threshold is set to 0.4, from the results, we can see that compared with ELM algorithm, the algorithm SVM-threshold and ELM-threshold which adds the probability threshold can also improve the time efficiency. ELM-threshold has advantages of speed compared with SVM-threshold. The pruning algorithm can be more than 1.3 times faster than the first algorithm. After adding probability threshold, the exist probability is less than probability threshold will be deleted; the probability will be also deleted when the probability is less than probability threshold on the process of computing probabilities.

Recently, keyword search has been studied extensively in traditional XML data. For a keyword query and an XML tree, most of related work took SLCA and ELCA as the results to be returned. XRANK [1] developed stack-based algorithm to compute SLCA. Reference [2] introduced two algorithms, they are the Indexed Lookup Eager algorithm when the keywords appear with significantly different frequencies and the Scan Eager algorithm when the keywords have similar frequencies. Reference [3] designed an MS approach to compute SLCA for keyword queries in multiple ways. Reference [4] took set intersection to compute SLCA. Reference [18] proposed the Indexed Stack algorithm to find ELCA. The probabilistic XML model has been studied recently. In [6], they first introduced a probabilistic XML model with the probabilistic types IND and MUX. IND means independant and MUX means mutually exclusive. Reference [5] summarized and extended the probabilistic XML models previously proposed; the expressiveness and tractability of queries on different models are discussed with the consideration of IND and MUX. Reference [17] addressed the problem of keyword search in probabilistic XML data and computed SLCA by scanning the keyword inverted lists once. Different from all the above work, we adopt set intersection to compute SLCA in probabilistic XML data.

8. Conclusions

In this paper, we have addressed the problem of keyword search in a general probabilistic XML data. And we adopt probabilistic XML model . Given a probabilistic XML tree , a set of keywords, and a probability threshold, we have discussed the challenges to find SLCA results with the probability which is more than probability threshold. Our algorithm has been proposed to compute the SLCA probabilities without generating possible worlds. We adopt set intersection to compute SLCA based on ELM. This paper uses ELM to classify nodes according to keyword search on probabilistic XML data. Keyword search on probabilistic XML data has received much attention in the literature. Finding efficient query processing method for keyword search on probabilistic XML data is an important topic in this area. In this paper, SLCA is selected as the results. Classification for nodes is important among all the operations. ELM can increase retrieval speed for the classification. So, ELM can support keyword search on probabilistic XML data. The experiments have demonstrated efficiency of our algorithms.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Yue Zhao, Ye Yuan, and Guoren Wang were supported by the NSFC (Grant nos. 61025007, 61328202, and 61100024), National Basic Research Program of China p(973, Grant no. 2011CB302200-G), National High Technology Research and Development 863 Program of China (Grant no. 2012AA011004), and the Fundamental Research Funds for the Central Universities (Grant no. N130504006).