Abstract
Machineconstructed knowledge bases often contain noisy and inaccurate facts. There exists significant work in developing automated algorithms for knowledge base refinement. Automated approaches improve the quality of knowledge bases but are far from perfect. In this paper, we leverage crowdsourcing to improve the quality of automatically extracted knowledge bases. As human labelling is costly, an important research challenge is how we can use limited human resources to maximize the quality improvement for a knowledge base. To address this problem, we first introduce a concept of semantic constraints that can be used to detect potential errors and do inference among candidate facts. Then, based on semantic constraints, we propose rankbased and graphbased algorithms for crowdsourced knowledge refining, which judiciously select the most beneficial candidate facts to conduct crowdsourcing and prune unnecessary questions. Our experiments show that our method improves the quality of knowledge bases significantly and outperforms stateoftheart automatic methods under a reasonable crowdsourcing cost.
1. Introduction
There are numerous information extraction projects that use a variety of techniques to extract knowledge from large text corpora and World Wide Web [1]. Example projects include YAGO [2], DBPedia [3], NELL [4], open information extraction [5], and knowledge vault [6]. These projects provide automatically constructed knowledge bases (KBs) with massive collections of entities and facts, where each entity or fact has a confidence score. However, machineconstructed knowledge bases contain noisy and unreliable facts due to the variable quality of information and the limited accuracy of extractors. Transforming these candidate facts into useful knowledge is a formidable challenge [7].
To alleviate the amount of noise in automatically extracted facts, these projects often employ ad hoc heuristics to reason about uncertainty and contradictoriness due to the large scale of the facts. There exists significant work in developing effective algorithms to perform joint probabilistic inference over candidate facts [7, 8]. Automated approaches have been improved in terms of quality but remain far from perfect. Therefore, effective methods to obtain high quality knowledge are desired. It is easy for human experts to determine whether a fact is correct or not. However, it is impossible to hire experts to correct all of them. Recently, due to the availability of Internet platforms like Amazon Mechanical Turk (MTurk), which enables the participation of human workers in a large scale, crowdsourcing has been proven to be a viable and costeffective alternative solution. Crowdsourcing is normally used to create labelled datasets to apply machine learning algorithms and becomes an effective way to handle computerhard tasks [9–11], such as sentiment analysis [12], image classification [13], and entity resolution [14]. The limitations of machinebased approaches and the availability of easily accessible crowdsourcing platforms inspire us to exploit crowdsourcing to improve the quality of automatically extracted knowledge bases.
In this paper, we study the problem of refining knowledge bases using crowdsourcing. Specifically, given a collection of noisy extractions (entities and their relationships) and a budget, we can obtain a set of high quality facts from these extractions via crowdsourcing. In particular, there are two subproblems to address in this study: error Detection: how can we effectively detect potential erroneous candidate facts which need to be verified by the crowd? Information extraction systems are able to extract massive collections of interrelated facts. Some facts are correct, while others are clearly incorrect and contradictory. Asking humans to verify all candidate facts is generally not feasible due to the large size of extractions. Hence, one of key challenges is to determine which subset of knowledge should be presented to the crowd for verification. Knowledge Inference: how can we accurately infer consistent knowledge based on crowd feedbacks? Errors introduced from the extraction process cause inconsistencies in the knowledge base, which may contain duplicate entities and violate key ontological constraints such as subsumption, mutual exclusion, inverse, and domain and range constraints.
To address these problems, we first introduce a concept of semantic constraints, which is similar to integrity constraints in data cleaning. Then we propose rankbased and graphbased algorithms to judiciously select candidate facts to conduct crowdsourcing based on semantic constraints. Our method automatically assigns the most “beneficial” task to the crowd and infers the answers of some candidate facts based on crowd feedbacks. Experiments on NELL’s knowledge base show that our method can significantly improve the quality of knowledge and outperform stateoftheart automatic methods under a reasonable crowdsourcing cost.
To summarize, we make the following contributions:(1)We propose a rankbased crowdsourced knowledge refining framework. We introduce a concept of semantic constraints and utilize it to detect potential contradictive facts. We present a score function taking both uncertainty and contradictoriness into consideration to select the most beneficial candidate facts for crowdsourcing.(2)We construct a graph based on the semantic constraints and utilize the graph to ask questions and infer answers. We judiciously select candidate facts to ask in order to minimize the number of candidate facts to conduct crowdsourcing. We propose pathbased and topologicalsortingbased algorithms that ask multiple questions in parallel in each iteration.(3)We develop a probabilitybased method to tolerate the errors introduced by the crowd and propagated through inference rules.(4)We conduct experiments using realworld datasets on a real crowdsourcing platform. Experimental results show the effectiveness of the proposed approaches.
The rest of this paper is structured as follows. We first review related work in Section 2 and introduce basic concepts related to our work in Section 3. Then we describe our proposed approaches in Section 4. We report experimental results in Section 5 and conclude in Section 6.
2. Related Work
Information extraction techniques are widely applied in the construction of webscale knowledge bases. In this paper, we use Never Ending Language Learner (NELL) [4] as a case study. NELL starts from a few “seed instances” of each category and relation and generates a knowledge base iteratively. It uses natural language processing and information extraction techniques to extract candidate facts from a large web corpus, using facts learned from the previous iteration as training examples. NELL has four subcomponents that extract candidate facts, namely, Pattern Learner, SEAL, Morphological Classified, and Rule Learner. NELL uses heuristics and ontological constraints to promote candidate facts into a knowledge base, assigning each promotion a confidence value.
Early work on cleaning a noisy knowledge base was considered by Cohen et al. [15]. They considered only a small subset of KB errors. Jiang et al. [8] proposed a method for cleaning knowledge bases at a broader scope using Markov Logic Networks (MLNs). This method performs joint probabilistic inference over candidate facts. To make inference and learning tractable, Jiang et al. surmounted these obstacles with a number of approximations and demonstrated the utility of joint reasoning in comparison to a baseline that considers each fact independently. More recently, Pujara et al. [7] improved the model of Jiang et al. by including multiple extractors and reasoning about coreferent entities. Furthermore, Pujara et al. used probabilistic soft logic (PSL) to avoid scalability limitation of MLNs. Dong et al. [6] employed supervised machine learning methods for fusing distinct information sources by combining noisy extractions from the web with prior knowledge derived from existing knowledge repositories. However, all of above methods are automated algorithms and do not leverage the power of crowdsourcing.
There also exist many research works that incorporate crowdsourcing into data and knowledge management, such as data cleaning [14, 16–19], record linkage [20, 21], schema matching [22–25], and knowledge acquisition [26, 27]. For example, Wang et al. [14] proposed CrowdER to solve the problem of entity resolution via crowdsourcing. Zhang et al. [16] used crowdsourcing to clean uncertain data. Chu et al. [18] proposed KATARA, a data cleaning system that utilizes the power of knowledge bases and crowdsourcing to clean tables. Demartini et al. [20] proposed ZenCrowd which uses a mixed humanmachine workflow to solve the entity linking problem. Gokhale et al. [21] studied how to do handsoff crowdsourcing record linkage which requires no involvement of developers. Sarasua et al. [22] studied the problem of ontology matching using crowdsourcing. Fan et al. [23] proposed a hybrid machinecrowdsourcing system for matching web tables. Kondreddi et al. [26, 27] developed HIGGINS, a framework for human intelligence games for knowledge acquisition, to expand and complement the output of automated information extraction methods. However, so far, there has been little discussion about how to use crowdsourcing to clean a noisy knowledge base with semantic constraints.
3. Preliminaries
3.1. Knowledge Bases
We consider an automatically extracted knowledge base as a probabilistic knowledge base, which stores facts in a form of triple (subject, predicate, and object), for example, (Brussel, citycapitalofcountry, Belgium). Each fact has a confidence score, representing the probability that the corresponding information extraction system “believes” the fact is correct. We formally define an extracted knowledge base as follows.
Definition 1. An extracted knowledge base (KB) is a 5tuple , where(1) is a set of entities. Each entity refers to a realworld object.(2) is a set of categories (or types). Each category is a subset of . Each entity belongs to one or more categories.(3) is a set of predicates. Each defines a binary relation between one or more pairs of types. For example, the predicate specifies a binary relation between and . We call the types of subject and object domain and range of the predicate, respectively.(4) is a set of weighted facts. For each , is a triple representing a fact that relation holds between and , where ; is a weight indicating the probability that an information extraction system believes the corresponding fact is correct, that is, a confidence score.(5) is a set of ontological relations. It defines concept hierarchy and semantic relationships between categories and relations.The definition of implies a concept hierarchy: for any , , is subclass of if and only if . Typing provides semantic context for extracted entities and is commonly adopted by the information extraction systems, so we make it an integral part of the definition. We use to denote that is an entity of category and use to denote that relation holds between entities and .
An automatically extracted knowledge base could be very large and noisy. For example, the knowledge vault [6] has 1.6B triples, of which 324M have a confidence of 0.7 or higher, and 271M have a confidence of 0.9 or higher. NELL so far has acquired a knowledge base with over 80M confidenceweighted facts, 2M of which have a confidence of 0.9 or higher. The overall estimated precision of NELL’s promoted facts across first 66 iterations is 74%.
3.2. Crowdsourcing
There exist a number of crowdsourcing platforms, such as MTurk and CrowdFlower. In such platforms, we can ask human “workers” to complete microtasks. For example, we may ask them to answer questions like “Is Italy a country?” Each microtask is referred as a human intelligent task (HIT). After having completed a HIT, a worker is rewarded with a certain amount of money based on the difficulty of the HIT. That is, invoking the crowd for knowledge cleaning comes with a monetary cost. In addition, a human worker may not always produce a correct answer for a HIT. To mitigate such human errors, we assign each HIT to multiple workers and then take a majority vote. However, even when majority votes are used, we may still get incorrect answers from the crowd. As a consequence, it is crucial to take human errors into account when designing a crowdbased algorithm.
Given a set of candidate facts to be sent to the crowd, we need to combine them into HITs. For each fact, the crowd needs to verify whether the fact is correct or not. We have four questions as one HIT, where each question contains a candidate fact requiring workers to verify its correctness. Box 1 shows an example of questions we generate as an HIT for MTurk. A brief description of the HIT is shown at the top. To assist workers to understand the fact, we provide a description of the related category or relation and use a human format for each fact.
4. Methodology
Our method takes an automatically extracted knowledge base as input and identifies a set of true facts from noisy extractions through crowdsourcing. We first introduce the concept of semantic constraints that can be used to detect potential erroneous facts and do inference among candidate facts. And then we propose a score function to measure the usefulness of candidate facts, in order to conduct crowdsourcing. In Section 4.3, we will explain how to leverage semantic constraints as inference rules to prune unnecessary questions. Finally, we will discuss our errortolerant techniques.
4.1. Semantic Constraints
Integrity constraints are effective tools used in data cleaning. This section introduces a similar concept called semantic constraints that can be used to clean noisy knowledge bases. These constraints can be learned from training data or derived from ontological constraints. The ontological constraints can be seen as axioms or rules in firstorder logic. For example, we can represent an ontological constraint (every is a ) with a rule, . Similarly, since a is not a , we can have a following rule, .
We derive semantic constraints according to ten types of ontological relations used in NELL: subsumption among categories and relations (e.g., every bird is an animal); mutually exclusive categories and relations (e.g., no person is a location); inversion (for mirrored relations like TeamHasPlayer and PlaysForTeam); the type of the domain and range of each predicate (e.g., the mayor of a city must be a person); the functionality of relations (e.g., a person has only one birth date); antisymmetric (e.g., if person writes book , then cannot write ); and antireflexive (e.g., a company cannot produce itself).
We use the following notations: and for subclass relationships; and for mutual exclusion relationships; and for domain and range relationships; for inversion; for functionality, for antisymmetric, and for antireflexive.
There are two types of semantic constraints according to the label transitive relation between candidate facts: contradictive relation and positive relation. The derived semantic constraints are shown as follows.
Contradictive Relation. In semantic constraints of this type, if a candidate fact is correct, we can infer that another candidate fact must be incorrect. Violations of contradictive relations indicate potential errors.
Positive Relation. In semantic constraints of this type, if a candidate fact is correct, we can infer another candidate fact is also correct.
Given semantic constraints and a set of candidate facts, we will generate a set of ground rules. A ground rule is a rule containing only candidate facts and no variables. We first instantiate a formula of semantic constraint using ontological relations and candidate facts in the knowledge base. Then, we omit the part of instantiated ontological relations since they are deemed to be true and obtain ground rules containing only candidate facts. For example, (a) in Box 2 are sample ontological relations defined in NELL. Considering candidate facts (b) in Box 2, corresponding ground rules generated according to semantics constraints are shown (c) in Box 2.
While contradictive semantic constraints can be used to detect potential erroneous facts, both positive constraints and contradictive constraints can be used to do inference among candidate facts.
4.2. Ranking Facts Based on Benefit
In this section, we propose a rankbased method for knowledge refining. We would like to select the most beneficial candidate facts to conduct crowdsourcing under a given budget for a knowledge base. It is obvious that we prefer to choose the facts that the corresponding information extraction system is most uncertain about. In addition, the facts that violate semantic constraints the most are of high risk and important ones to the knowledge base. It is beneficial to verify them via crowdsourcing. In this paper, we will use the contradictoriness to estimate the risk and importance of candidate facts. In summary, we will first evaluate the benefit of candidate facts in terms of improving the quality of the knowledge base by taking both uncertainty and contradictoriness into consideration. Then we will rank them according to their evaluation scores and choose top facts to conduct crowdsourcing.
Uncertainty Score. A threshold is often applied to probabilistic extractions. Facts with a high confidence can be assumed to be correct, while facts with a confidence less than threshold are deemed to be most likely incorrect. The most uncertain facts are those whose probability are closest to threshold . We use to denote the machinebased probability estimation of a fact being correct. Therefore, we model the uncertainty of a fact as follows.
The information extraction systems commonly provide a confidence score for each candidate fact, that is, the weight in the knowledge base definition. We adopt the weight as the machinebased probability estimation of a fact being correct.
Information extraction systems usually use many different extraction techniques to generate candidates. For example, NELL produces separate extractions from lexical, structural, and morphological patterns. If the patterns used to extract each candidate fact are provided, this extra information can help us better estimate the probability. We can use a simple logistic regression model learned from training data to predict the probability of each candidate fact being correct [28, 29]. The features are whether each pattern cooccurs with the candidate fact, and the coefficients reflect the reliability of patterns.
Contradictoriness Score. Based on contradictive semantic constraints introduced in Section 4.1, we can detect inconsistency, errors, and conflicts among candidate facts as the violations of these constraints. The more facts a fact is contradictory with, the more likely it is a potential error. We define the contradictoriness score of a fact as follows.where is the number of violated ground rules of when appears in . The addition of 1 ensures the contradictoriness scores are greater than zero.
For example, considering a semantic constraint (rule) (defined in Section 4.1), supposing there are two ground atoms and , and three facts , , and , then the fact violates two ground rules of . Hence, .
Combining the above two factors, we use the following function to rank candidate facts.
Based on ranking scores, we select a batch of candidate facts to conduct crowdsourcing at a time. Algorithm 1 describes the overall procedure for refining a knowledge base. Given an extracted knowledge base and a set of semantic constraints (SCs), it first initializes confidences of candidate facts being correct with machinebased estimations and calculates scores using (15)–(18). Then, it selects candidate facts from the knowledge base () to conduct crowdsourcing where is a budget allowed for improving the knowledge base.

4.3. Leveraging Semantic Constraints Pruning Unnecessary Questions
In this section, we discuss how to utilize semantic constraints as inference rules to reduce the crowdsourcing cost. The rankbased method discussed above simply selects top candidate facts to conduct crowdsourcing at a time. However, by leveraging semantic constraints, we can infer the correctness of a candidate fact from other facts without acquiring the intelligence from the crowd. Thus, we can effectively use the budget for crowdsourcing. For example, if is correct, we do not need to crowdsource the candidate fact since it can be inferred to be correct based on the inversion constraint (12). If is correct, we can infer any is incorrect based on the functionality constraint (7).
4.3.1. GraphBased Algorithm
To leverage semantic constraints, we model the selected candidate facts (under a given budget) for crowdsourcing as a graph based on ground inference rules and try to infer the correctness of some candidate facts using the graph model.
Definition 2 (graph model). Given a set of candidate facts, we build a directed graph , where each vertex in is a candidate fact, , where represents all positive relations and represents all contradictive relations. Given two candidate facts and , if , there is a directed edge from to to represent this positive relation; if , there is a directed edge in from to to represent this contradictive relation.
Figure 1 shows the graph for candidate facts in Box 2. We use to denote the subgraph containing only edges in and to denote the subgraph containing only edges in .
Graph Coloring. Each vertex in has two possibilities: the candidate fact is correct and we color it Green; the candidate fact is incorrect and we color it Red. Initially, each vertex is uncolored. Our goal is to utilize the crowd to color all vertices.
A straightforward method is to take the candidate fact on each vertex as a question and ask workers to answer the question, that is, whether the candidate fact is correct. If a worker thinks that the candidate fact is correct, the worker returns Yes and No otherwise. Based on the workers’ results, we get a voted answer on each vertex. If majority of workers vote Yes, we color it Green; otherwise we color it Red. Next, we interchangeably use vertex, fact, and question if the context is clear.
This method is rather expensive as there are many vertices in the graph. To address this issue, we propose an effective coloring framework to reduce the number of questions. Algorithm 2 shows the pseudocode. It first constructs a graph based on ground inference rules (line 12). Then it selects an uncolored vertex and asks workers to answer Yes or No for the vertex (line 4). If majority of workers vote Yes, we not only color Green, but also color all of its descendants in ) Green and color all of their children in ( and ) Red and their ancestors in ( and ) Red (line 6–8). In other words, for , we can infer that is also correct; for , we can infer is incorrect. If majority of workers vote No, we not only color Red, but also color all of its ancestors in () Red (line 10). In other words, for , we infer that is also incorrect. If all the vertices have been colored, the algorithm terminates. Otherwise, it selects an uncolored vertex and repeats the above steps (line 3–12).

Obviously, this method can reduce the crowdsourcing cost as we can avoid asking questions for many unnecessary vertices. For example, considering the constructed graph in Figure 1, a naive method is to conduct crowdsourcing for all eight facts. However, if we first conduct crowdsourcing for , as majority of workers vote Yes, we can color and their descendants and Green and color Red without conducting crowdsourcing for its descendants. Then if we continue to conduct crowdsourcing for , as majority of workers vote No, we can color and its ancestor Red.
An important problem in the algorithm is to select the minimum number of vertices to conduct crowdsourcing, so that all vertices in the graph are colored. We will first formulate the question selection problem and then propose a pathbased algorithm and a topologicalsortingbased algorithm that select multiple vertices in each iteration to solve the problem.
4.3.2. Optimal Vertex Selection
As we know, we have the basic coloring strategy: if a vertex is Green, then all of its descendants in are Green but all of its children in and their ancestors in are Red; if a vertex is Red, then all of its ancestors in are Red. We will discuss how to support the case that the two conditions do not hold in Section 4.4.
Definition 3 (optimal graph coloring). Given a graph, the optimal graph coloring problem aims to select the minimum number of vertices as questions to color all the vertices using the coloring strategy.
For example, in Figure 2, if we sequentially select vertices , and , we should ask seven questions. The optimal crowdsourced vertices are , and (highlighted by bold cycles), because the colors of these vertices cannot be inferred based on the colors of other vertices. Next we will study how to identify the optimal vertices. Before that, we introduce a concept for making our explanation easily.
Definition 4 (boundary vertex). A vertex is a boundary vertex if its color cannot be inferred based on other vertices’ colors. There are four cases in : all of its parents have different colors with the vertex; all of its children have different colors with the vertex; it has no parent and its color is Green; or it has no children and its color is Red. In addition, there are two cases in : all of its parents are Red; it has no parents (inedge).
For example, is a boundary vertex in as its parent has a different color. is not a boundary vertex as its parent has the same color and ’s color can be inferred based on ’s color. is also a boundary vertex in as its parent is Red.
Here we use to denote all the boundary vertices in the graph and and to denote the boundary vertices in and , respectively. There are overlaps between and , so . Ideally, all vertices in should be checked, because their colors cannot be inferred. Thus, the number of vertices checked using any algorithm should not be smaller than the number of boundary vertices in . However, since we do not know the ground truths of all vertices in the graph, we cannot identify the boundary vertices in advance. To address this problem, we propose effective algorithms to identify the boundary vertices in with a theoretical guarantee and use a greedy algorithm to identify the boundary vertices in , respectively. Meanwhile, we note that there are more boundary vertices in because of the limited influence of a vertex in than those in . Hence, we consider firstly the boundary vertices in .
Optimal Vertex Selection in . Given a path in , we can use a binary search method to select the boundary vertices. We initially crowdsource the midvertex on the path. Based on the result of the midvertex, we determine the next step. There are two situations: If the midvertex is colored Green, its descendants’ colors can be inferred but its ancestors’ colors can not be inferred. Thus, we can crowdsource the next midvertex between the current vertex and the source vertex of the path. If the midvertex is colored Red, its ancestors’ colors can be inferred but its descendants’ colors can not be inferred. Therefore, we can crowdsource the next midvertex between the current vertex and the destination vertex of the path. Iteratively, we can find all the boundary vertices. For the path with vertices, the number of crowdsourcing vertices is .
Optimal Vertex Selection in . We greedily select the vertices with no inedge in and the vertices with the largest confidence value in each contradictive group (i.e., connected subgraph), since only Green vertices can be used to infer colors of its children.
4.3.3. PathBased Algorithm
We can divide the graph into a set of disjoint paths (i.e., any two paths have no common vertices). Then we can use the binary search method described above to determine the vertices for crowdsourcing. As the maximum length of a path is , the number of crowdsourcing vertices is , where is the number of disjoint paths. If , we need to crowdsource vertices.
Finding Disjoint Paths. In order to find the disjoint paths, we transform the graph into a bipartite graph , where and there is an edge between and if there is an edge . We find maximal matching in , which is a maximal set of edges in where any two edges do not share a common vertex in and . That is, for any two edges , in the matching, and . Obviously, any two edges in the matching sharing the same vertex in must be on the same path. Based on this idea, we utilize the maximal matching to find the disjoint paths as follows.
Let denote the maximal matchings, denote the set of the first vertices in and denote the set of the second vertices in . Then is the set of vertices that have no inedges, and we can take them as the first vertices of paths. For each such a vertex , if it has an edge , we take as the second vertex in a path. Then we check whether has an edge . Iteratively, we can find the path starting at . The paths computed using maximal matching satisfy disjoint, complete, and minimal paths [30].
Then we propose a serial pathbased vertexselection algorithm. The pseudocode is shown in Algorithm 3. It computes disjoint paths in using maximal matching and selects the optimal vertex of the longest path to conduct crowdsourcing. When there is no path with length greater than 1, it selects an optimal vertex according to .

The serial pathbased vertexselection algorithm can only publish a single fact to a crowdsourcing platform at a time, which is unable to crowdsource candidate facts simultaneously and results in long latency. To overcome this drawback, we extend the pathbased algorithm to support parallel settings, which select multiple vertices and publish the corresponding candidate facts simultaneously to the crowdsourcing platform in each iteration. The pseudo code is shown in Algorithm 4. We first identify the disjoint paths and use the optimal vertex selection strategy discussed in Section 4.3.2 to select one vertex from each path to conduct crowdsourcing in parallel. When there is no path with length greater than 1, we select the optimal vertices according to . Based on the answers of these vertices, we color the graph. Next we remove the colored vertices and repeat the above step until all the vertices are colored.

However, the parallel algorithm may generate conflicts. For example, if is colored Green and is colored Red, then there is a conflict on where and , because is inferred as Green based on but is inferred as Red based on . To address this confliction, we can use majority voting to vote ’s color and randomly choose one if a tie occurred.
4.3.4. TopologicalSortingBased Algorithm
Note that the maximal matching can be computed in [30], which is too slow in practice when used for a large knowledge base. To address this issue, we perform a topological sorting on the vertices. We first identify the set of vertices with zero indegree, denoted by . Then we delete them from the graph and find another set of vertices whose indegrees are zero, denoted by . We repeat this step until all vertices are deleted. Suppose there are sets, . Obviously vertices in each have no inedges (as their indegrees are 0). Therefore, each can be considered as an independent set.
We design a topologicalsortingbased algorithm to improve the time efficiency of the maximal matching. It first computes topologicalsorted sets in . And then it crowdsources vertices in the middle set in parallel. When , it selects optimal vertices according to to conduct crowdsourcing. Based on the results of these vertices, it colors the graph and removes the colored vertices and from the set . It repeats the above step and iteratively colors all vertices. The pseudo code of the topologicalsortingbased algorithm is shown in Algorithm 5.

4.4. Tolerating Errors
There are two types of possible errors in our graphbased framework. The first type is caused by workers’ errors and the second type is propagated through inference rules. For example, suppose a candidate fact is actually incorrect. However, the workers wrongly label it as correct. This error is caused by workers’ errors. Consider a contradictive fact of , whose labels are correct. Our graphbased algorithms could wrongly label it as incorrect using inference rules. This error is propagated through inference rules. We will discuss how to address these errors in our framework as follows.
Confidence of Workers’ Answer. To tolerate workers’ errors, we assign each candidate fact to multiple workers and aggregate their answers. There are many methods to compute the confidence of workers’ answers. We use majority voting as an example and any other techniques can be integrated into our framework. Suppose each candidate fact is assigned to workers and workers vote a consensus answer (e.g., Yes) and workers vote the other answer (e.g., No). The confidence of the voted answer is .
ErrorTolerant Coloring. For each crowdsourced fact, if the confidence of workers on this fact is high, for example, greater than 0.8, we use inference rules to label related candidate facts; otherwise, we label it uncertain (color the vertex in graph as Blue) and do not use it to infer the labels of other candidate facts. For the Green and the Red vertices, we take them as ground truths as their answers have large confidences. Then we utilize them to color Blue (uncertain) vertices. Specifically, we use facts in Green and Red vertices to learn a logistic regression model on the machine confidence scores (provided by the information extraction system) and predict the labels of facts in Blue vertices.
The pseudo code of our errortolerant coloring algorithm in shown in Algorithm 6. It uses the coloring strategy only for the vertices with highconfidence answers (line 5) and utilizes the logistic regression model to color the vertices with lowconfidence answers (lines 1112).

5. Experiments
In this section, we evaluate our methods and report experimental results.
5.1. Experimental Setup
Datasets. NELL [4] generates a knowledge base iteratively. In each iteration, NELL uses facts learned from the previous iteration and a corpus of web pages to generate a new set of candidate facts. NELL selectively promotes those candidates that have a high confidence from the extractors and obey ontological constraints with the existing knowledge base to build a highprecision knowledge base. We use extractions of the 165 iteration of NELL released by [7] to evaluate our method, containing over 1 M extractions, with a manual labelled test set consisting of 4546 instances and a training set consisting of 9866 instances. There are 70 K ontological relations in total. Table 1 shows the statistics of the data set. The training set can be used to calibrate the confidence scores from the original system. When training data is not available, we can adopt the confidence provided by the information extraction system as the probability.
We calculate contradictoriness scores among all candidate facts and select candidate facts from the test set for crowdsourcing. We use a threshold 0.5 for the confidence score. For crowdsourced data, a fact is treated as correct only when more than half of crowd answers are “Yes.” We compare our methods with other popular methods in terms of the quality, the number of questions, and the number of iterations. To evaluate the quality, we use three metrics, that is, precision, recall, and F1. Suppose the set of correct facts is , and the set of facts that an algorithm reports as correct is . Then the precision is , the recall is , and the Fmeasure is .
Crowdsourcing on MTurk. We use MTurk for crowdsourcing. We post all candidate facts in the test set to MTurk and record the crowd’s answers in a local file . During our experiments, when a method requests to crowdsource candidate facts, we retrieve answers from instead of posting facts to MTurk. This ensures that all methods utilize the same set of crowdsourced results, for the fairness of comparisons. We take four microtasks as one HIT, where each microtask contains a candidate fact. To assist workers to understand the fact, we provide a description of each category or relation and use a human format for each entity (see Box 1 as an example). We pay $0.02 each time a worker completes an HIT and $0.01 to MTurk for publishing each HIT. We assign each HIT to five workers. We require that each worker has an approval rate greater than 95%. This setting intends to ensure that all workers provide reasonably accurate answers to the HITs.
5.2. Experimental Results
We first compare our method with stateoftheart methods for knowledge refining. Then we evaluate our rank function, question selection strategies, and errortolerant techniques, respectively.
5.2.1. Evaluation of Our Methods
In order to evaluate the effectiveness of our proposed techniques, we compare our methods Rank, Graph, and Graph+ (graphbased method with errortolerant techniques) with two recent methods for cleaning automatically extracted knowledge bases, that is, MLN [8] and PSL [7], using previously reported results on the same evaluation set. We also compare with the default strategy used by the NELL [4] project to choose candidate facts to include in the knowledge base.
MLN [8]. This method defines a Markov logic network (MLN) to perform jointly probabilistic inference over candidate facts. We compare our method against the bestperforming MLN model from [8], which expresses ontological constraints, and candidate and promoted facts trough logical rules. The MLN method reports an output with a 0.5 marginal probability cutoff, which maximizes the F1 score.
PSL [7]. This method uses probabilistic soft logic (PSL) to jointly reason candidate facts and identify coreferent entities, which can perform inference more efficiently. The PSL method reports results using a softtruth threshold 0.55 to maximize F1.
NELL [4]. We also compare the default strategy used by the NELL project to choose candidate facts to include in the knowledge base. We take the promoted facts as its result.
Given a budget (e.g., 40% of candidate facts), our Rank method selects top candidate facts to conduct crowdsourcing at a time. For the Graph and Graph+ methods, we construct a graph with the top candidate facts and use the topologicalsorting algorithm to select questions for crowdsourcing. Figure 3(a) shows a comparison of the overall performance of our Rank, Graph, and Graph+ methods. We report the results of our methods under different budgets. From Figure 3(a), we can see that, given a larger crowdsourcing budget, our method can obtain a higher performance. Graph+ and Rank achieve a similar quality. However, Graph+ asks fewer questions than Rank, as shown in Section 5.2.3. Graph+ outperforms Graph, because Graph+ can tolerate workers’ errors by not coloring unconfident vertices and thus avoids enlarging the errors by a wrong colored vertex. Figure 3(b) shows a comparison of the overall performance with the stateoftheart methods. From Figure 3(b), we can see that MLN and PSL perform well in precision or recall, respectively. Our method improves both precision and recall. Overall, our method improves significantly on F1. With a reasonable budget (above 20% test instances), our method outperforms both MLN and PSL methods in terms of F1.
(a)
(b)
5.2.2. Evaluation on Rank Function
In this experiment, we evaluate our ranking function, which is a key for selecting crowdsourcing candidate facts in the rankbased method. This function, denoted as UC, quantifies the usefulness of a candidate fact by considering both its uncertainty and its contradictoriness with other facts. We compare UC against following baselines. Method Uncertainty considers only uncertainty scores. Method considers only contradictoriness scores. Method Random selects candidate facts for crowdsourcing randomly.
Figure 4 shows the results of F1 using different ranking functions. The UC method achieves the highest F1. The Uncertainty method achieves the highest recall at the beginning but the speed of improvement slowing down with the increment of the budget. This is because there are many false positives and negatives among candidate facts with confidences around the threshold, which have higher uncertainty scores. The error rate drops quickly when the difference between confidence and threshold increases, while considering contradictoriness can still help detect potential erroneous facts effectively. The method achieves better precision than . consistently performs the worst.
(a) F1
(b) Precision
(c) Recall
5.2.3. Evaluation on Question Selection
From Section 5.2.1, we can see that the Graph+ method has a similar quality with the Rank method. In this section, we focus on the efficiency of question selection algorithms in terms of the number of questions and the number of iterations. We evaluate the pathbased and topologicalsortingbased question selection algorithms proposed in Sections 4.3.3 and 4.3.4. We compare four algorithms: Random: which randomly selects a vertex in each iteration. SinglePath: which selects a vertex from the longest path in each iteration. Multipath: which selects multiple vertices from multiple disjoint paths in each iteration. TopologicalSorting: which selects multiple independent vertices based on topological sorting in each iteration. We compare them in terms of the quality, the number of questions, and the number of iterations, shown in Figure 5.
(a) Quality
(b) Questions
(c) Iterations
From Figure 5(a), we can see that the four methods achieve the similar quality, because different question orders do not affect the quality based on inference rules. From Figure 5(b), we can see that the two parallel algorithms Multipath and TopologicalSorting crowdsource a few more questions than SinglePath. This is because Multipath may crowdsource vertices with ancestordescendant relationships and TopologicalSorting may crowdsource vertices with the same descendants which can be avoided by our serial algorithm SinglePath based on the inference rules. TopologicalSorting outperforms Multipath because TopologicalSorting crowdsources independent questions in each iteration while Multipath may crowdsource dependent questions. SinglePath outperforms Random and reduces the number of questions. This is because SinglePath can effectively identify the boundary vertex using the optimal vertex search strategy. From Figure 5(c), the two parallel algorithms Multipath and TopologicalSorting significantly outperform SinglePath and Random as they crowdsource questions in parallel.
To evaluate our graphbased method (Graph) on reducing the number of questions, we conduct additional simulation experiments on the complete dataset, using NELL beliefs as ground truths and simulating workers with accuracy of 90%. Our experimental results are shown in Figure 6. Figure 6 shows that our graphbased method crowdsources fewer questions than our rankbased method (Rank). It saves even more than 30%, comparing with the rankbased method. This is because we can utilize the inference rules to prune many candidate facts that do not need to be crowdsourced. The rankbased method achieves a higher quality at the expense of crowdsourcing many more questions. Besides, the graphbased method only involves a few iterations, because it can crowdsource many questions in parallel.
(a) Quality
(b) Questions
(c) Iterations
5.2.4. Evaluation on the ErrorTolerant Solution
In this section, we evaluate the effectiveness of our errortolerant solution (proposed in Section 4.4) by comparing two algorithms: Graph: which does not consider errors; Graph+: which extends Graph to tolerate errors. We use simulated workers and conduct evaluation under different accuracy levels of crowdsourcing workers (i.e., 70%, 80%, and 90%) on test dataset. We compare Graph+ with Graph in terms of quality, the number of questions, and the number of iterations. Our experimental results are shown in Figure 7.
(a) Quality
(b) Questions
(c) Iterations
From Figure 7, we can see that Graph+ achieves a higher quality than Graph, because it can tolerate the errors introduced by crowdsourcing workers and avoid error propagation along the inference rules. Graph+ significantly outperforms Graph for lowquality workers. With the increment of the accuracy level of workers, the improvement decreases. On the other hand, Graph+ crowdsources a little more questions than Graph. This is because Graph+ does not utilize the inference rules for some facts, so that it reduces the number of inferred facts. From Figure 7(c), we can see that the two methods have the same number of iterations. This is expected, since the only difference between Graph+ and Graph is that Graph+ does not infer the answers for some unconfident facts. The accuracy level of crowdsourcing workers has little impact on the number of questions and the number of iterations for both methods, because the number of questions and the number of iterations are determined by the graph structure. Therefore, we can use the errortolerant technique to improve the quality of the knowledge base.
6. Conclusions
We proposed a costeffective method for cleaning automatically extracted knowledge bases using crowdsourcing. Our method uses a ranking score to select the most beneficial candidate facts for crowdsourcing in terms of improving the quality of knowledge bases. We constructed a graph based on the semantic constraints and utilized the graph to crowdsource questions and infer answers. We evaluated the effectiveness of our methods on realworld web extractions from NELL. Our experimental results showed that our method outperforms both MLNbased and PSLbased methods in terms of F1 under a reasonable crowdsourcing cost.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was partially supported by Chinese NSFC (61170020, 61402311, and 61440053), Jiangsu Province Colleges and Universities Natural Science Research project (13KJB520021), Jiangsu Province Postgraduate Cultivation and Innovation project (CXZZ13_0813), and the US National Science Foundation (IIS1115417).