Abstract

For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribution of this paper is to develop a new data mining method for outlier detection over deep web. In our approach, the query space of a deep web data source is stratified based on a pilot sample. Neighborhood sampling and uncertainty sampling are developed in this paper with the goal of improving recall and precision based on stratification. Finally, a careful performance evaluation of our algorithm confirms that our approach can effectively detect outliers in deep web.

1. Introduction

As a result of the rapid development of e-commerce, the deep web has been increasingly valued by data mining researchers in recent years. The deep web, which is termed to make a contrast with the surface web, refers to data sources with back-end databases that are only accessible through a query interface [1]. Currently, the vast majority of research on the deep web considers the problem on how to build an interactive query system or a vertical search system using data integration technologies [2, 3]. Seldom have articles conducted data mining over the deep web. Mining deep web data sources has unique challenges. The fundamental reason is that the acquisition of deep web data sources is limited, and data can only be obtained through a query interface. A query interface consists of a number of input attributes for users to set up their queries. An online database returns data that matches the query by generating web pages dynamically including one or more output attributes.

This paper focuses on the problem of outlier detection over the deep web. To the best of our knowledge, this problem has not yet been addressed in existing work. An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. There is a great practical significance in detecting outliers over the deep web. For example, outliers may be commodities that have abnormal price because of mistakes during data entry. The third-party collaborators of a website want to detect outliers in time and notify the responsible person for the website to modify the data for the purpose of decreasing losses, while the website users have a great interest in finding these commodities.

Outlier detection has always been a hot research topic in the field of data mining. On this problem, a lot of exciting results have been published in recent years. As the survey [4] described, outlier detection techniques can be broadly divided into distance-based approaches [5, 6], density-based approaches [7, 8], clustering-based approaches [9], and information theoretic approaches [10]. However, these outlier detection methods need to know the distribution of underlying data, which is impractical in the context of deep web. According to the best of our knowledge, this is the first work on detecting outliers in deep web.

A naive solution for outlier detection in deep web is to download all the records from a back-end database and then mine its outliers using a traditional outlier detection approach discussed in the survey [4]. As stated earlier, records from the back-end database can only be retrieved by submitting queries to its corresponding query interface. Since queries consume server resources a lot of time and there exist top- records constraints and query times constraint for every IP address, that is, restrictions over the query interface, this naive method is costly impractical.

Thus, a practical solution is to randomly sample the back-end database of a deep web to detect outliers. The back-end database is a kind of hidden database. Sampling for outlier detection in a hidden database has been studied in [11]. They proposed a random walk scheme over the query space provided by the interface to randomly sample such databases. Their method uses random sampling for addressing the problem. In their method, the computation between each data record is very expensive. Unlike their work, samples can only be obtained by querying a deep web in our scenario. The cost of submitted queries is much more expensive, comparing with computation or memory cost. Therefore, the sampling cost, which refers to the number of distinct queries to be issued to obtain a sample from the deep web, is the dominant factor of detecting outliers in the deep web. Furthermore, random sampling is often used as a baseline method in the statistical areas. As outliers are rare, random sampling will not achieve a good recall intrinsically. This indicates that the random sampling method needs a great amount of sample cost in order to detect reasonable outliers, which is not suitable in the context of deep web.

Our idea in this paper on outlier detection on deep web is primarily related to distance-based outlier detection. We formally define that an instance is considered to be an outlier if the percentage of the instances in a database that lie greater than a given distance from the instance is greater than a true percentage threshold. However, the true percentage value is unavailable. We have to estimate it from the sample we have obtained. As we know, instead of determining whether an instance is an outlier or not (i.e., a binary determination), it is much more suitable to assign each instance a probability of being an outlier. How to estimate the probability of an instance being an outlier is a difficult issue. In this paper, we estimate the probability of an instance to be an outlier based on its estimated percentage.

In summary, this paper first presents a completely novel problem: outlier detection in deep web. Then, it proposes and empirically evaluates a stratification-based outlier detection method over deep web. The detailed contributions of our solution can be concluded as follows. First, we present a completely novel problem: outlier detection in deep web. We have developed a stratification scheme for a deep web data source. In our method, the stratification is done through a hierarchical tree that models the relationship between the input and output attributes based on a pilot sample. Second, we have proposed a stratification-based outlier detection method over deep web. Instead of random sampling across the strata, we have developed a neighborhood sampling scheme for collecting more outliers. Query spaces with high probability of containing outliers are explored. Finally, we have developed an uncertainty sampling algorithm to verify the uncertain instances in order to improve the outlier detection precision.

The rest of this paper is organized as follows. In Section 2, we simplified the concept of deep web and outliers. Section 3 elaborates the method we proposed to solve the problem of detecting outliers in the deep web. The comparison methods and experimental results are presented in Section 4. Section 5 briefly reviews some related work. In Section 6, we give our concluding remarks.

2. Preliminary Concepts

Before we discuss our solution for outlier detection over deep web, we mainly introduce the basic process of sampling and outlier detection in the deep web environment in this section.

2.1. Process of Sampling in the Deep Web

Let us consider an example where Table 1 shows a part of a real deep web back-end database which contains 10 instances. Each column of Table 1 represents an attribute. The attribute can be divided into three types according to their domain, which are categorical, continuous, and text. The websites provide user with a query interface which contains a set of attributes what we call it the input attributes. The “Brand,” “Type,” and “Screen” attributes represent the three input attributes in our example. A user query can be looked as an assignment of a subset of input attributes of the query interface and the corresponding query results are returned in the form of HTML. For example, if a user queries the database with Screen = 13.3, then the 4th and 10th instances are returned as query results including the two output attributes Price and StandBy. In most cases, we are prone to mine the output attributes of interests, such that we are more likely to find abnormality prices in this example. The process of sampling the deep web is a repeated process of querying the back-end databases. It should be noted that there are a number of constraints in the query interface, such as the top- constraints and IP restrictions. Therefore, a common objective for mining the deep web is to minimize the number of queries issued through the query interface.

2.2. DB-Outlier

Among the existing methods of detecting outliers, a distance-based outlier (DB-Outlier) detection is one of the most commonly used and simplest approaches. An object in a dataset is a outlier if at least a percentage pct of the objects in lie in the locations that are greater than distance from ; that is, the cardinality of the set is less than or equal to (100 − pct)% of the size of . For example, when and , the 9th instance would be an outlier if we only consider the price attribute as the interest output attribute.

According to our knowledge, outlier detection in the deep web is a completely novel problem. It engages outlier detection with the deep web. There is no existing solution available. In the following section, we will introduce our solution gradually.

3. Stratification-Based Outlier Detection

After having introduced the basic process of sampling and outlier detection in deep web environment, it is the time for us to elaborate the problem and our proposed method in this section. This paper primarily considers the case of categorical input attributes. Continuous input attributes can be considered as discrete categorical ones.

Typically, given a query composed of values of one or more of the input attributes, a deep web data source will return the number of data records satisfying the input query. Using this information, the distribution on the input attributes can be obtained. Since the distribution of the output attributes is unknown, discovering the outliers on the output attributes is a great challenge. However, if the relationship between the input attributes and the output attributes is known, we can identify the outliers of the output attributes with using the distribution of the input attributes. In our method, the relationship between the input attributes and the output attributes is built by stratification, which is a process of dividing an entire population into subpopulations based on a pilot sample.

There are two other important steps in our approach after stratification, which are neighborhood sampling and uncertainty sampling. The goal of these two sampling steps is to collect more outliers and keep a suitable precision under a limited cost. For each record we obtained, we assign it a probability of being an outlier. We classify each record into three classes (i.e., outlier, normal, and uncertain) based on its probability.

In general, our approach proceeds as follows. We obtain a pilot sample by randomly sampling the deep web first. Stratification of the population is conducted based on the pilot sample. Then, neighborhood sampling across the subpopulations is performed to collect more outliers. Next, in order to identify the uncertain ones, we perform uncertainty sampling so as to avoid the misjudgment. Each step of our approach will be explained in detail as follows.

3.1. Stratification

When the known population consists of several significantly differential parts, the population is always divided into subpopulations called strata for samples to adequately reflect the distribution of the population. In our algorithm, stratification is performed so that data records contained in the same stratum should be as similar as possible. Thus, we can isolate outliers in a few strata. The whole data in a deep web data source can be considered as the entire query space, whereas the subpopulations correspond to the query subspaces. After the stratification process, the distribution of output attributes is predicted effectively by the values of input attributes in each stratum. Query submission from the corresponding subspace can thus help us to obtain subpopulation data records. The primary purpose of stratification is to identify and group similar data records included in each stratum. Thus, how to perform stratification is an important issue.

We adopted the strategy of building a hierarchical tree to stratify the deep web data source. Formally, represents the set of input attributes, represents the domain of , and represents the set of output attributes of interests. For a leaf node, , let represent its corresponding query composed of , which is a subset of the input attributes. Under the query space of node , we define the radius for the corresponding subpopulation, , which can be computed as follows:where denotes a data record of the node ’s subpopulation and , denotes the value for output attributes in . is the center of the subpopulation corresponding to the node , and is the size of the subpopulation corresponding to the node . For a deep web data source where the data is not directly accessible, the radius is estimated based on a sample. Here, we introduce another concept that potential splitting input attributes are defined as the subset of input attributes that are not contained in the query ; that is, . When stratifying a leaf node, we can choose an input attribute from .

For a potential splitting input attribute associated with the domain , the decrease of radius is computed aswhere is the conditional probability of that takes the th value from its domain under the query space . The conditional probability can be computed aswhere the function returns the number of data records under the space of in the back-end database, which is supported by most websites. We can even use the method proposed in the literature [12] to realize this function if it is not supported. The potential splitting input attribute with the largest decrease of radius is chosen to split the space of the node . Using to stratify the query space of , the node splits off child nodes, where each node denotes a subpopulation. denotes the radius for the th child node generated by splitting input attribute and its computation is similar to that of . We then iterate the stratification on each child node. The query of the next iteration can generally be represented by . The whole process eventually forms a hierarchical tree where each leaf node represents a stratum.

In most cases, the input query space would be overstratified so that each stratum contains only one sample. Thus, we utilize a statistical hypothesis test to check whether the decrease of radius is significant. The idea behind the hypothesis test is that if there is not a significant relationship between the splitting input attribute and the output attributes , the distribution of output attributes in the node would be similar to that in each of the children nodes. This means that there would be a little reduction in the radius after splitting the leaf node . Specifically, we use the hypothesis testing [13]:: there is no significant relationship between and the output attributes .: there is a significant relationship between and the output attributes .

The statistics for the hypothesis testing is computed aswhere denotes the radius of the leaf node , denotes the average radius of child nodes of by splitting with the input attribute , denotes the size of the domain of , and denotes the sample variance of the radius of child nodes. With a significant level , the decision rule is as follows: if , we reject the hypothesis and declare that there is a significant relationship between and the output attributes . Otherwise, we accept the hypothesis and declare that there is no significant relation between and the output attributes . We can obtain the value of from the standard statistics data.

In Algorithm 1, we summarize the overall process of splitting a node , which is associated with the query and a list of potential splitting attributes . The input to the algorithm also includes a significant level and a leaf node set . in our algorithm is used to save the leaf node in the tree. At the beginning, the entire query space is represented by the node. The corresponding query of the node is null and the potential splitting attributes list is the complete set of input attributes of this data source. The initial set of leaf nodes is empty.

(1)
(2)
(3)for each    do
(4) compute using (2)
(5)if    then
(6)
(7)
(8)end if
(9)end for
(10) = size of SplitAttribute’s domain DM
(11) initialize the radius of child-spaces
(12) for    to  
(13)  compute using (1) based on pilot sample
(14) end for
(15)  
(16)
(17) compute the statistics using (4)
(18) if    then
(19) for each    do
(20)
(21) )
(22) end for
(23) else
(24)
(25) end if

For each potential splitting attribute, the radius decrement is computed according to (2) in Lines (1)–(10). The potential splitting attribute, which creates the maximum decrement in the radius, is selected, and its domain size is (Line (10)). Following this, the radius of its each child node will be estimated according to (1) based on the pilot sample in Lines (11)–(14). Then, the statistical hypothesis test is conducted according to (4). If the null hypothesis is accepted, which means the radius decrement is not significant, the node is set to a leaf node and will be included in the leaf node set. Otherwise, the space of node N is split by the , and children are generated for the node. The associated query of each child node will be updated (Line (20)) and the set of potential splitting attributes is now. The process of splitting is then applied to the children nodes of node N in Line (21). The algorithm stops when there is no available node that can be further stratified.

3.2. Neighborhood Sampling

After stratification, similar data objects tend to be from the same query subspace or neighbor query subspaces. It indicates that we can obtain more outliers from the query subspace or its neighbor query subspaces in which we have identified outliers. Thus, the urgent key consideration is how to identify a data object’s abnormality in the deep web. For a data object , let denote the fraction of data objects at distance greater than a given distance . Combined with the definition of DB-Outlier we have described above, we can justify the data object as an outlier if its fraction is greater than a given threshold pct. As it is impractical to download all the data in the deep web, we have to estimate the fraction based on the stratification.

Suppose that we have obtained strata after stratification, given a sample obtained from each stratum in the pilot sample; let denote the number of samples drawn from the th stratum; namely, . To facilitate our description, we define an indicator variable aswhere denotes the th data record in the th stratum and denotes a function for computing the Euclidean distance of a pair of data objects. Now, we introduce our estimation of the fraction aswhere denote the proportion of the data amount of the th stratum. It turns out that is an unbiased estimation; that is, .

The variance for iswhere denotes the variance of indicator variable in the th stratum.

According to (6), and are fixed with respect to the th stratum. Therefore, the variance of can be written as

Using the definition of in (5), it is shown that is a typical 0-1 distributed random variable. , where is the probability of . Thus, only when and . On combining the above, we immediately obtain (8).

Since query submitting is expensive, we need to collect more outliers with a low query cost. In other words, we want to retrieve outliers as much as possible within a given query cost. If the probability of each stratum containing outliers is given, we could achieve this goal easily by assigning more query cost to the stratum with a high probability.

For each data record in the sample , we could identify the abnormality of by computing its fraction . Thus, the probability of the th stratum containing the outliers can be estimated as , where is the number of outliers identified in the th stratum. For the purpose of finding more outliers and keeping the underlying distribution, our method tends to allocate more samples to the stratum where its probability is higher or its population is large. With a given sample cost across the strata, the sample allocation for the th stratum is computed as

This shows that the sample size of the th stratum is proportional to its probability and population. It should be noted that there is a greedy fashion so as to maximize the size of the retrieved outliers. The greedy fashion works in the following way: sort strata in descending order according to its and then assign to each stratum one by one until samples have been collected, where is the size of the population of the th stratum. However, this greedy fashion will break the consistency between the sample distribution and the underlying distribution while our sampling allocation will not.

3.3. Uncertainty Sampling

In this subsection we introduce our uncertainty sampling method. The uncertainty here is with respect to the possibility for a data record belonging to the outliers. According to the degree of uncertainty, data records can be divided into three classes: (a) the outlier class; (b) the normal class; and (c) the uncertain class. For the data records belonging to (a) or (b) class, we can surely identify their abnormality. But for the data records belonging to (c) class, there is a great possibility of misjudgment occurring. The misjudgment, acknowledging true outlier as the normal class or true normality as the outlier class, will lead to a low precision. The fundamental reason for the misjudgment occurring is that there is a diversity between the distribution of samples and the underlying population, which leads to the incorrect estimation of the fraction. To solve this problem, we developed an uncertainty sampling with the goal of reducing the variance of estimated fraction in order to minimize the distance between and .

Now, we formulate our description above. Let denote the probability of a data record to be an outlier. Thus, for a data record, we identify it as an outlier if or a normal one if ; otherwise we identify it as an uncertain one, where , are the predefined parameters and . The computation of can be seen as the calculation of the probability of the fraction transcending the threshold pct. According to the Lindburg-Levy theorem [14], the fraction obeys the Gaussian distribution when the size of underlying population is large enough. As each data record has its estimated fraction and the variance , the probability distribution of the fraction has . As a result, the probability can be computed aswhere is the standard Gaussian distribution function.

Thus, a set of uncertain data objects will be picked out. To improve the precision, our task is to obtain a sample for identifying uncertain data records. For uncertain data records with estimated , the summation variance of is . In this sampling phase, we need to decide the size of the sample drawn from the th stratum so that the summation variance is minimized. As a result, the distance between estimated and will be minimized. The total number of data records drawn from all the strata is fixed, which is denoted as . We formulate our goal as

The solution for the objective to be minimized above iswhere denotes the variance of the indicator variable of the th stratum with respect to the th uncertain data record.

Using the definition of the summation variance, we have

Under the limitation , the problem of minimizing is a typical convex optimization problem in the field of statistics. Using the well-known optimization method Lagrange multipliers, we can directly obtain the solution described in (14).

Our uncertainty sampling method can be viewed as a generalization of the Neyman sample allocation method [15] for stratified sampling. The size of sample drawn from the th stratum is proportional to the size of the subpopulation and the summation variance of the indicator variable. The distance between the estimated and is minimized after uncertainty sampling. We now introduce our method of mining outliers from the uncertain data records.

A distance between each pair of sampled data records is computed and then we compute the probability for each sampled data record. A data record is determined to be an outlier if ; otherwise it is a normal data record, where is a predefined parameter.

A sufficient condition for identifying a data record is as follows.

For the outliers,

For the normal data records, where denotes the estimation of fraction of neighbors in neighborhood and denotes the quantile of the standard Gaussian distribution.

For a sampled data record , is identified as an outlier if ; otherwise is identified as a normal data record. To facilitate our presentation, we use to denote the fraction of neighbors in distance neighborhood; denotes the estimated where certainly. Using (12), we have the following.

If is an outlier,

Otherwise, where denotes the quantile of the standard Gaussian distribution. when , and when . With the knowledge of (8), we then conclude the sufficient condition described in formulas (16) and (17).

3.4. Summary: Overall Algorithm

Now, we summarize our overall method for detecting outliers from a deep web data source. The overall process is shown in Algorithm 2. The inputs for the algorithm are the set of input attributes IS, the set of interested output attributes OS, the size of sample that is to be drawn in the neighborhood sampling, and the size of sample that is to be drawn in the uncertainty sampling.

(1)pilot sample initial random sample
(2)
(3)
(4
(5)compute , the probability of containing outlier for each
(6)for all    do
(7)
(8) data records sampled from
(9)
(10) end for
(11)
(12)
(13) for all    do
(14)  if  (DR) >   then
(15)   
(16)  else if  (DR) ≥   then
(17)   US US + DR
(18)  end if
(19) end for
(20) for all    do
(21)  
(22)   data records sampled from
(23)  
(24) end for
(25) for all  DR
(26)  if  (DR) >   then
(27)   
(28)  end if
(29) end for

At the beginning, a pilot sample is drawn from the entire population of the deep web data source by random sampling [11]. The entire query space of the deep web is stratified by calling Algorithm 1 from the root node (Lines (1)–(4)). A stratum is represented by a leaf node as we described before. After stratification, the samples in the pilot sample can be divided into strata, which is . Thus, we estimate the probability of containing outliers for each stratum based on its samples (Line (5)). Next, the neighborhood sampling is performed in the following steps: for each leaf node , the number of data records , to be drawn from the space of , is computed using (11) in Lines (6)–(10). After obtaining the neighborhood samples, we can identify the outliers and the uncertain data records by computing their estimated fraction (Lines (11)–(19)). Next, we conduct the uncertainty sampling for . For each leaf node , the number of data records , to be drawn from its query space, is computed by (14) in Lines (20)–(24). After obtaining these data records, we finally mine the outliers from the data records which are not identified as outliers before and combine the outliers as our final result.

4. Experimental Evaluations

In this section, we will evaluate the benefits from using the stratification strategy over the query space, neighborhood sampling, and uncertainty sampling, respectively, and compare our proposed method with the baseline on the deep web using three different datasets. (1) A synthetic dataset is generated by MATLAB. (2) HTTP and SMTP are the subsets of KDD CUP 1999 that could be downloaded from the UCI repository; HTTP and SMTP are benchmark datasets for outlier detection. (3) A live experiment is conducted on https://autos.yahoo.com/. In order to evaluate the benefit of each component of our solution, we create several variants of our solution in the following subsection.

4.1. Setup

Our evaluation has been performed over a combination of real and synthetic datasets.

Our synthetic dataset is generated by MATLAB. It contains 4100 data records, including 4000 normal records and 100 outlier records. There are seven attributes (i.e., 5 categorical input attributes and 2 continuous output attributes). Four clusters exist on the two output attributes, which are generated by a Gaussian distribution. The output attributes are created to be dependent on the input attributes.

Two real datasets, referred to as HTTP and SMTP, are the subsets of KDD CUP 1999 that could be downloaded from the UCI repository. The HTTP dataset contains 623091 data records while the SMTP dataset contains 96554 data records. Preprocessing has been performed on these two datasets. We randomly sample 8000 data records from these two datasets, respectively, as our final experimental datasets. The original dataset contains 41 attributes, whereas we only reserve two basic attributes (i.e., “src_bytes” and “dst_bytes”) as our output attributes and five attributes (i.e., “duration”, “flag”, “land”, “wrong_fragment”, and “urgent”) as our input attributes. The repeated data in each dataset is reserved only once. Instead of using the existing abnormal labels of data records, we prefer to label the data records utilizing the concept of DB-Outlier.

We conduct live experiments over a subset of a real-word hidden database (https://autos.yahoo.com/) of the data on new cars located within New York, NY, and Washington, DC. The deep web database consists of 3 attributes (Make, Model, and Zip) as input attributes and 7 attributes (i.e., brand, distance, mileage, year, price, etc.) as output attributes.

In our experiments, we set the significant level, where the corresponding statistics is 1.96, at the step of overstratification within the stratification. We adopt the three common criteria to evaluate all methods, which is precision, recall, and -measure. Here, we briefly introduce the three criteria. Precision is the percentage of the true outliers within the whole outliers identified by our method. Recall is the percentage of the true outliers within the whole underlying outliers. The criterion -measure is the geometric mean of precision and recall, which is commonly used in the unbalanced classification. As the characteristics of outliers, the criterion -measure is very suitable in our scenario.

Outlier records of synthetic datasets HTTP and SMTP are known in order to evaluate our methods on live deep web database https://autos.yahoo.com/, which are precision, recall, and -measure. We download the data on new cars located within New York, NY, and Washington, DC. This yields 163,000 data records. The data consists of 7 attributes (i.e., brand, distance, mileage, year, price, etc.). Then we use outlier detection tools to detect outlier records of the dataset, as the experimental evaluation benchmarks.

For each dataset, we repeat the independent experiments 50 times, and the average results are reported as our final results. The -axis of our experimental figures represents the total sample size . The size of pilot sample is always and the ratio between neighborhood sampling and uncertainty sampling is always 1.

Our proposed method, referred to as SNU, will be compared with the following methods:(i)SRS: we refer to the simple random sampling method as SRS. Compared with our proposed algorithm, this method works in a manner that uses random sampling in each sampling step. This method is a baseline method.(ii)SN: this method is similar to our proposed algorithm except that it uses the random sampling replacing the uncertainty sampling.(iii)SU: this method is similar to our proposed algorithm except that it uses the random sampling replacing the neighborhood sampling.

4.2. Evaluation of Stratification and Neighborhood Sampling

In this subsection, we focus on evaluating the benefits from stratification and neighborhood sampling. For this purpose, we compare our method SNU with two other methods, SRS and SU.

Figures 1(a), 1(b), 1(c), and 1(d) show the comparisons among the three methods in terms of recalls on the four datasets (one synthetic dataset, two real datasets: HTTP and SMTP, and one live Yahoo dataset). From Figure 1(a), we can see that the recalls of both SU and SRS are significantly smaller than that of SNU. This demonstrates the effectiveness of neighborhood sampling, the component in our proposed method SNU. This also shows that the distribution of the input attributes is helpful for collecting more outliers, where the stratification on the population is necessary for neighborhood sampling. Furthermore, comparing with SRS, SNU improves its recall by up to 163.1%. This provides us with the confidence of applying our method to real-world deep web applications to detect outliers.

Figures 1(b) and 1(c) show the results on the two real datasets. There is a very similar trend in the results. Compared with the synthetic dataset, all methods have lower recall. A reasonable explanation is that the relationship between the input attributes and the output attributes is not as strong as for the synthetic dataset. As a result, the similarity of data records contained in the same stratum is not high. After careful observation, we can find that the recall in the SMTP dataset is higher than that in the HTTP dataset under the same sample size. This is likely because the outlier rate in the SMTP is correspondingly higher than in the HTTP. Moreover, compared with SRS, SNU has improved the recall by up to 109.7% average in these two real datasets.

The remaining Figure 1(d) shows the results on the live deep web database https://autos.yahoo.com/. There is also a very similar trend in the results. Compared with the three datasets, all methods have lower recall under the same sample size. Moreover, compared with SRS and SU, SNU has improved the recall by up to 111.6%, 119.7% average in the live dataset. Thus, our approach clearly results in more effective methods than using SU and SRS on the live deep web.

4.3. Evaluation of Uncertainty Sampling

We now evaluate the benefits of uncertainty sampling to further identify the uncertain data records. For this purpose, we compare our methods with two other methods, SRS and SN.

Figure 2 shows the comparison of the precision for these three methods, applied on the four datasets. Figure 2(a) shows the results from the synthetic dataset while Figures 2(b), 2(c), and 2(d) show the results from the HTTP, SMTP, and live Yahoo datasets, respectively. Figures 2(a), 2(b), 2(c), and 2(d) show that the precision of SN is smaller than of SNU. This demonstrates the effectiveness of uncertainty sampling because SN uses random sampling instead of uncertainty sampling. However, the precision of our method SNU is a little lower than that of SRS. This is mainly due to the fact that the output attributes distribution of our method has a bias after neighborhood sampling. It should be noted that this situation can be alleviated when the sample size is large enough. Furthermore, we can also see that both SNU and SRS have convincing precisions on all four datasets. This indicates that we do not need to spend much effort in improving the precision. However, we do need to pay more attention.

4.4. Evaluation of Detection Performance

After evaluating the benefits of each component of our method SNU, we will evaluate the performance of our method SNU in terms of detecting outliers over deep web. For this purpose, we compare our method with the baseline method SRS on the four datasets. Figure 3 shows that our method outperforms SRS method in all datasets. The average improvement of -measure in the synthetic, HTTP, SMTP, and live Yahoo datasets is 184.5%, 78%, 68.1%, and 75.4%, respectively. Overall, we observe that our method needs fewer sampled data records over SRS to achieve the same performance. Since the cost of obtaining each sample is the dominant factor in mining the deep web, this reflects that our method can achieve significant reductions in the query cost. Furthermore, combined with the corresponding recalls and precisions in Figures 1 and 2, we conclude that recall is the dominant factor in improving the performance of outlier detection methods over deep web. In addition, precision has little room for improvement.

There exists a vast majority of research on the deep web. However, these researches focus on how to build an interactive query system or a vertical search system using data integration technologies [1618]. Recently, with the development of sampling and crawling over the deep web [1921], mining deep web has attracted more attention than before [2224]. Moreover, outlier detection research has always been a hot topic in machine learning and data mining. However, existing work has not considered the combination of these two aspects yet. According to our knowledge, this paper is the first to conduct outlier detection over the deep web. In this section, we will present the related work and discuss the difference of our work next.

5.1. Sampling for Outlier Detection

Outlier detection using a sampling strategy has been described by various researchers [2527]. Wu et al. utilized a sampling method to detect DB-Outliers. They assumed that the calculation cost of the distance between each pair of samples is expensive. Instead of calculating the distance between each pair of samples, their method randomly samples samples from the neighborhood points. Kollios et al. attempted to identify a density estimator based on random sampling and then used the density estimator to detect DB-Outliers with biased sampling [26]. Abe et al. presented a classification-based outlier detection method in their study [27]. They attempted to transform an outlier detection problem into a simplified classification problem based on a sample selection mechanism of active learning. Unlike their work background, the problem we are considering in this paper is in the context of deep web. Even if a particular sample we need has to be collected, we cannot obtain it directly because of the characteristics of deep web.

5.2. Deep Web Sampling

Dasgupta et al. [11] proposed a random sampler called HDSampler. They attempted to obtain a random sample set from a deep web by utilizing a random walk-over-input-attribute space strategy. The research of Liu and Agrawal [28] is relatively similar to our work. They addressed the problem of clustering over a deep web data source by stratified sampling [2931]. Their method performed stratification over the deep web first, then selected the representative sample from each stratum, and finally conducted hierarchical clustering. Inspired by their work, we proposed a method addressing the problem of outlier detection over a deep web data source in this paper.

6. Conclusions

This paper presents a novel problem in deep web: outlier detection over deep web. In this paper, we proposed a novel solution that can divide the outlier detection procedure into three components: stratification over deep web, neighborhood sampling, and uncertainty sampling. We developed the stratification scheme through a hierarchical tree to model the relationship between the input attributes and the output attributes. Instead of random sampling across the strata, we developed a neighborhood sampling scheme for collecting more outliers. Furthermore, we developed the uncertainty sampling algorithm to verify the uncertain instances in order to improve detection precision. We evaluated the performance of our solution empirically via the synthetic and real datasets. Our experimental results show that our approach indeed enhances significantly recall and -measure, comparing with the simple random sampling method. In the future, we wish to address this problem by mapping a deep web data source into a graph and conduct data mining over the graph.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research was partially supported by the Natural Science Foundation of China (nos. 61440053, 61472268, and 41201338).