Abstract
Distant supervision (DS) automatically annotates free text with relation mentions from existing knowledge bases (KBs), providing a way to alleviate the problem of insufficient training data for relation extraction in natural language processing (NLP). However, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data. In this paper, we model two types of biases to reduce noise: (1) biasdist to model the relative distance between points (instances) and classes (relation centers); (2) biasreward to model the possibility of each heuristically generated label being incorrect. Based on the biases, we propose three noise tolerant models: MIMLdist, MIMLdistclassify, and MIMLreward, building on top of a stateoftheart distantly supervised learning algorithm. Experimental evaluations compared with three landmark methods on the KBP dataset validate the effectiveness of the proposed methods.
1. Introduction
With the explosion of web resources, traditional supervised machine learning, which relies on a small set of manually annotated training samples, may not be able to catch up with the uptodate information needs. Likewise for relation extraction, a hot research issue in NLP predicts semantic relations for a pair of name entities.
DS (distant supervision/weak supervision) annotates a large scale of free text with relation mentions from existing KBs, providing a way to alleviate the problem of insufficient training data for relation extraction. DS initially assigns relation labels to sentences according to relation mentions when a sentence contains a certain pair of entities but does not care about whether it actually conveys the corresponding semantic relation. Therefore, the heuristic annotation process does not guarantee the correctness of the generated labels, promoting a hot research issue on how to efficiently make use of the noisy training data.
For example, suppose a name entity pair 〈Obama, Hawaii〉 has three valid relation labels, travel_to, born_in, and study_in; according to the KB (i.e., Freebase), multiple sentences containing this entity pair from largescale free text will be marked to convey either of the three relations (see S1–S3 in Table 1). However, we are not able to decide the specific relation label for each sentence in advance according to these annotations. Therefore, it is difficult for traditional supervised learning algorithms to learn directly from these heuristically annotated sentences. In addition, due to the incompleteness problem of either the KB or the free text, false negatives (FNs) and false positives (FPs) are inevitable, giving rise to noisy training labels. For example, at least one sentence in S1–S3 (Table 1) should be labeled with study_in according to the heuristics for DS, but actually none of them conveys this relation due to the incompleteness of free text. Similarly, because of the incompleteness of the KB, we can hardly find any relation from the KB that S3 can express.
Previous researches presented several methods to utilize the heuristically generated labels and train weak classifiers to predict unseen relations: singleinstance learning (SIL) [1], multiinstance learning (MIL) [2], multiinstance multilabel learning (MIML) [3, 4] and some related extensions [5, 6], the embedding model [7] and the matrix factorization method [8, 9], and so on. Among them, multiinstance multilabel learning for relation extraction (MIMLRE) proposed by [4] is one of the stateoftheart learning paradigms. MIMLRE treats multiple sentences (i.e., S1–S3) that contain a certain pair of name entity as a bag (multiinstance) and has all possible labels (i.e., travel_to, born_in, and study_in) marked to these sentences (multilabel). Through modeling the distantly labeled data from the baglevel as well as the instancelevel, MIMLRE transfers the data to styles that traditional supervised learning algorithms can easily deal with (see more details in Section 4).
Nevertheless, as mentioned above, the data generated by DS heuristics contain noises such as wrong labels; it is necessary to equip the learning algorithms with noise reduction methods. In addition, we argue that the instance diversity problem is prevalent for training weak classifiers. That is, we cannot guarantee that all the instances we collected and labeled through the DS heuristics are of high quality just like those labeled by human labelers.
The noisy problem can also be learned about from previous researches such as [10]. In [10], the authors manually annotated over 1800 sentences from free text and compared them to the KB. They finally got 5.5% FNs and 2.7% FPs, which is a good evidence for using noisy reduction strategies. Besides, from our observations toward the training data, we found that the instance diversity problem is remarkable in the dataset, reflected by the distributions for expectations (see Figure 2).
In this paper, we model two types of biases for noise reduction: (1) biasdist to model the relative distance between points (instances) and classes (relation centers); (2) biasreward to model the possibility of each heuristically generated label being incorrect. Biasdist is modeled to weaken the maximum probability assumption (the class with the maximum probability should be assigned) during the EM process in MIMLRE, so that it is not always true that the class with the maximum probability is accepted. This bias is proposed according to the diverse qualities of training instances. Biasreward is modeled to weaken the impact of wrong labels, resulting in the case that wrong labels would be with low predicting confidence. This bias aims at efficiently modeling the noisy grouplevel labels. Based on the biases, we propose three methods, MIMLdist (multiinstance multilabel learning with distance), MIMLdistclassify (multiinstance multilabel learning with distance for classification), and MIMLreward (multiinstance multilabel learning with reward), building on top of the MIMLRE framework. Therefore, this work can be seen as an extension of MIMLRE.
We set up experiments on one of the most popular benchmark datasets, the KBP dataset built by Surdeanu et al. [4]. Evaluation results compared with three landmark algorithms validate the effectiveness of the proposed methods. Particularly, MIMLdistclassify is built in the predicting phase, which is simple and fast to complete, boosting the from the baseline 27.3% to 29.03%. MIMLdistreward converges much faster than the original algorithm which reaches 29.01% on .
The contributions of this paper can be summarized as follows: (1) We are the first to explicitly model the bias related to the instance diversity problem and gain considerably better results; (2) the modeling methods toward the two types of biases are both validated to be efficient through the experiments.
The rest of the paper is organized as follows: Section 2 briefly introduces the literature; Section 3 describes the two types of biases; the models are detailedly described in Section 4. Sections 5 and 6 are the implementations and experiments. Discussion and conclusion are arranged at last.
2. Related Work
In this section, we briefly introduce the literature of distantly supervised relation extraction and the noise reduction methods for it.
2.1. Relation Extraction
Relation extraction (RE) is a hot research issue in NLP. In early researches, various approaches based on rich syntactic and semantic features were proposed. For example, Zelenko et al. introduced various subtree kernels with Support Vector Machine and Voted Perceptron learning algorithms [11]. In [12], the authors proposed three types of subsequence kernels for RE on proteinprotein interactions and toplevel relations from newspapers. Zhou et al. used tree kernelbased method with rich syntactic and semantic information and a contextsensitive convolution tree kernel [13]. Recent work focused mostly on deep neural network based structures, that is, single convolutional deep neural network based model [14, 15] and the combination of recursive neural network and convolutional neural network based model [16].
2.2. Distant Supervision for Relation Extraction
Distant supervision was firstly introduced in the biomedical domain by mapping databases to medical texts [17]. Since then, DS gained much attention in both information extraction (IE) and further RE. Most of the earlier researches include [18, 19] used singleinstance learning according to the assumption that one pair of entity only corresponds to a single relation. In recent years, distant supervision is widely used in open IE to map Wikipedia infoboxes to wiki contents or webscale texts [20, 21]. For RE, distant supervision is also employed for mapping Freebase relations to large scales of free text (i.e., New York Times) and predicting relations for unseen entity pairs [1–4]. Most aforementioned work used SIL, MIL, or MIML to train classifiers, which set strong baselines in this field. In addition, recent researches also include embedding based models that transferred the relation extraction problem into a translation model like [22–24], nonnegative matrix factorization (NMF) models [8, 9] with the characteristics of training and testing jointly, integrating active learning and weakly supervised learning [25], integer linear programming (ILP) [26], and so on.
2.3. Noise Reduction Methods for DS
One type of noise is that we cannot decide the actual label for each instance and can only estimate them according to some constrains. Atleastone is a representative constrain which considers that one relation label is positive when at least one of the mentions in the bag gets the label but discards the others. Related work directly modeled the noisy training data with multiinstance frameworks and learned model parameters through several times of EM iterations [2–4]. Intxaurrondo et al. [27] employed several heuristic strategies to remove useless mentions. Xu et al. [10] employed a passage retrieval model to expand the training data from instances with high confidence. Takamatsu et al. [28] directly model the patterns that express the same relation. Another type of noise is the wrong baglevel labels due to the incompleteness of either the KB or the textual corpus. Min et al. [5] put another layer to the MIMLRE architecture to model the true labels of a bag to model the incompleteness of the KB. Ritter et al. [6] added two parameters to directly model the missing of texts and the missing of KBs and set them with fixed values; they considered some side information such as popularity of entities as well. Fan et al. [9] added a bias factor in their model to represent the noises. The idea of considering the instance diversity problem which relates to data quality in this work is a bit similar to Xu’s passage retrieval model [10] but we are from a distinct perspective. The bias modeling idea is something like [6] but we model the missing in an indirect way which employs rankingbased measures.
3. Biases
In this section, we generally describe the two types of biases we propose.
3.1. BiasDist
Biasdist (bias related to distance) aims at tackling the instance diversity problem rising from the DS annotation process. In traditional supervised machine learning, when human annotators label training instances, they incline to label those instances that they are confident of and discard the others so that a pure training set can be created. A typical example is the annotation agreement standard for evaluating a corpus and the instances whose labels are with hardly any disagreements are usually considered as being of high quality. On the contrary, there is no human intervention for the DS annotation; hence the quality of training samples cannot be guaranteed. As a result, when we use the classifier to assign relation labels to instances, it is likely that the predicting score for the true relation label is lower than that for a false label.
More clearly, we assume that the predicting scores for the instances from two relation classes are drawn from two individual Gaussian distributions and show the case described above in Figure 1. Suppose and are two Gaussian distributions with expectations and , respectively. We further assume that the predicting scores for class and class follow these two distributions. The axis of Figure 1 denotes the predicting probabilities (scores) which range from 0 to 1. Suppose that a point (i.e., ) on the axis indicates a high probability (score) when predicting a certain instance using a multiclass classifier. For both of the two classes, may be an acceptable predicting score based on which we can classify the corresponding instance into the positive areas for both of them. However, the distances from to and reflect that distribution has much stronger ability to generate than . Thus in this case, according to the predicting score and the mean for the two classes, the instance should be classified to rather than .
To conclude, if the predicting scores for instances on different relation classes distribute diversely, the maximum probability assumption may not work well. Biasdist is proposed to weaken this assumption through replacing the absolute predicting score to a relative form.
3.2. BiasReward
Biasreward (bias according to label reward) is proposed to model the incompleteness of the KB and the textual corpus. The most typical setting for multiinstance learning in the literature is the training bags; that is, multiple instances containing a certain pair of name entities would fall into the same bag and share the same sets of labels.
As illustrated in Table 1, the baglevel labels come from the KB and would endure the incompleteness problem of the KB or free text. Further, many works [4, 29, 30] define baglevel positive labels as those from the KB and negative labels as the relations that the key entity (the first name entity in the pair) does not have according to the KB. As crucial constraints for distant supervision, noisy baglevel labels have bad effect on the models. Much formally, the constraints emphasize thatwhere stands for the th bag, and denote the positive and negative label sets for this bag, and is the relation label for the th instance in the bag. If the KB is incomplete, the bag would have wrong negative labels. And if the textual corpus is not complete, the bag may be associated with wrong positive labels.
The problem of incompleteness is inevitable and has become a popular issue for distant supervision. We also take the entity Barak Obama as an example. If the KB we refer to is a year 2005’s version but the free texts are recently collected, it is likely that the relation president_of 〈Obama, U.S.〉 is a wrong label and the sentences containing Obama and U.S. would be divided into the negative instances for the relation president_of.
To reduce the bad effects by incorrect negative labels, we add a reward to each baglevel negative label and multiply it by a weighting factor that reflects the likelihood of being nonnegative. Meanwhile, we add a penalty to each positive label and a weighting factor that reflects the likelihood of being nonpositive. We use a rankingbased method to determine the likelihood by computing rankings among all possible labels. More details would be described in Section 4.4.
4. Bias Modeling for MultiInstance Multilabel Learning
In this section, we introduce the details of our methods for bias modeling.
4.1. Notations and Concepts
MIML takes a number of bags as the training data, learns a twolayer (the instancelevel and the baglevel) weak classifier, and predicts relations for unseen sentences. For an easier description of MIMLRE and our methods, we define the following notations and concepts:(i), the whole textual corpus;(ii), the set of all known relation labels;(iii), the set of known relations for a certain entity (the first/key entity in a pair) from KB;(iv)instance, sample, a sentence that contains the target entity pair and its quantized version for classification, respectively;(v)bag/group, a set of instances that contain the same entity pair;(vi), the instancelevel classifier (classifier), a multiclass classifier;(vii), the baglevel classifier (classifier), a set of binary classifiers;(viii), the expectation/mean of the probability distribution on predicting scores for relation ;(ix), the variance of the probability distribution on predicting scores for relation ;(x), an instance in the dataset;(xi), the th bag;(xii), the positive label set of the th bag;(xiii), the negative label set of the th bag.
In addition, we use , , , and to denote class labels and , , , , , , , and for predefined constants. Thus, the training data for MIMLRE is constructed as the following: multiple instances containing the same pair of entities constitute a bag , with all possible relations for the pair as its positive label set and as the negative label set .
4.2. MIMLDist
We construct two individual models based on biasdist: (1) MIMLdist adds biasdist to the training steps of MIMLRE and updates the label assignment process (step); (2) MIMLdistclassify simply adds biasdist in the testing step for predicting new sentences. Following MIMLRE, MIMLdist uses the maximum likelihood estimation (MLE) to model the whole training data (2) and the hard expectation maximization (EM) algorithm to learn the model parameters ( and ) iteratively ((3)–(6)). Consider
EStep. For each instance in a bag, its label is decided by both the instancelevel classifier and the baglevel classifier. One haswhere denotes the bag labels in which the label for the current instance has been updated by the classifier and stands for the th instance in the th bag. Considerwhere is the mean value of the predicting scores for class and
We assume that the predicting scores for each relation follow a Gaussian distribution and is its expectation. The variance had little effect on the result according to our early experiments so we discard it.
Then the model parameters are updated through the step by (6).
MStep. Consider the following:
The following two equations are used to infer the instancelevel and baglevel labels through corresponding classifiers.
Inference. Consider the following:
In the testing phase, similar to MIMLRE, MIMLdist also employs a Noisyor model instead of atleastone to avoid data sparsity.
Noisyor Model. Consider the following:
We show MIMLdist in Algorithm 1 (the procedures of MIMLdistclassify and MIMLreward are similar except for several tiny steps). The expectation for each class is computed and stored after each label update in training (1.61.7 in Algorithm 1).

4.3. MIMLDistClassify
Analogous to MIMLdist, MIMLdistclassify also assumes that the predicting score for each relation follows a Gaussianlike distribution. The difference between them is that MIMLdistclassify computes the expectations after all the training iterations and thus is much easier and simpler than MIMLdist. It normalizes the predicting probability in the testing phase by
When comparing MIMLdist with MIMLdistclassify, from the perspective of time complexity, MIMLdist computes biasdist of each relation label for each instance, which costs much when the scale of the training data is large or labels are updated frequently. Moreover, the time complexity makes the parameter tuning process more difficult. Comparatively, MIMLdistclassify changes very little the original training process, and if the model is trained (i.e., MIMLRE), it need not be changed any more. The parameter tuning only locates in the testing phase which is simple and fast.
To conclude, MIMLdistclassify is a kind of parameter tuning strategy on the classification hyperplanes. It is efficient for distant supervision because the training data in this task suffer from the instance diversity problem much heavier than most other supervised learning tasks, the training data of which are carefully polished by annotators. It is very likely that the probability distribution for each relation class diversify from each other very much. MIMLdist is a more direct way to model biasdist, in that it changes the label assignment strategy by considering the impact of data diversity. Compared with MIMLRE, MIMLdist lowers down the chances of trapping into local minimums and thus is expected to perform better than MIMLRE.
4.4. MIMLReward
Different from the previous two methods which modify the probabilities produced by the classifier, MIMLreward updates the probabilities generated by the classifier. Concretely, the multiplication item in (3) is updated by biasreward. As mentioned above, we import a rankingbased method to determine the likelihood of each baglevel label being incorrect and add a reward or penalty from the original probability. Formally, we define the following notions:
(i) For a positive label, is potentially wrong if some irrelevant (negative or unlabeled) labels have higher ranks than .
(ii) For a negative label, is potentially wrong if it has a higher rank than some positive labels.
Moreover, we define as the instance that has the maximum predicting confidence for label (the key instance for ) in the bag and as the number of labels that has a higher rank than :in which is the indicator function (if , ) and can be any label set. Intuitively, for a positive label in a bag, the bigger is, the more possible this label is wrong when setting to be the nonpositive label set, while, for a negative label , the smaller is, the more possible this label tends to be wrong, when setting to be the positive label set. We employ two constants, and (, ), to denote the intensity of the above tendencies and take them as a reward or penalty to a single label. The posterior probabilities at the baglevel are computed instead by ( represents a positive label and represents a negative label)where is the normalized factor which is set to be the number of irrelevant labels (for each positive label) or the number of positive labels (for each negative label). and are smoothing factors.
To conclude, MIMLreward is proposed to alleviate the problem of noisy labels. As we can read from (3), the label assignment is partly contributed by the baglevel labels (the second multiplication item), which is built on the assumption that all the baglevel labels are correctly annotated. However, noisy labels are inevitable according to our previous analysis. The penalty and reward mechanism for biasreward is to weaken the assumption, allowing that some labels could be wrong and can be discovered and considered during training. Similar ideas can be seen in [6, 8] who also took into account the incorrectness of baglevel labels.
5. Implementation Details
For a fair comparison, most of the settings in implementation follow MIMLRE including the number of training iterations for EM (up to 8 times) and the number of folds for cross validation to avoid overfitting . The constants and were optimized on the developing set and were finally set to be 0.7 and 0.5 for MIMLdist and MIMLdistclassify, and the constant was set to be 2 for both the two methods. The penalty and reward parameters and were set to be 0.2 and 0.2, respectively. For the smoothing parameters and in MIMLreward, we simply set them to 0.01. In addition, we use the same features as MIMLRE which takes multiple syntactic and semantic level features and dependency based level features. In addition, we added biasdist only on those positive labels but discarded the negative label NIL. We also sampled 5% negative examples for training.
6. Experiments
6.1. Dataset Description
We test on the KBP dataset, one of the benchmark datasets in this literature constructed by Surdeanu et al. [4]. The resources are mainly from the TAC KBP 2010 and 2011 slot filling shared tasks [25, 26] which contain 183,062 and 3,334 entity pairs for training and testing. The free texts come from the collection provided by the shared task, which contains approximately 1.5 million documents from a variety of sources, including newswire, blogs, and telephone conversation transcripts. The KB is a snapshot of the English version of Wikipedia. After the DS annotation, we finally got 524,777 bags including 950,102 instances for training. For testing, 200 queries (a query means a key entity) from the TAC KBP 2010 and 2011 shared tasks containing 23 thousand instances are adopted, in which 40 queries constitute the developing set. The relation labels include slots of person (per) and organization (org), and the total number of labels is 41.
6.2. Experiments
We will show the evaluation metrics, experiment results, and some observations from the data in this section.
6.2.1. Evaluation Metrics
P/R Curve. Following previous work, we report the stability of the algorithms by figuring curves. A curve is generated through computing precision and recall by selecting different proportions of the testing data. Generally, the higher the position of a curve is in the figure, the more stable the corresponding algorithm is.
Final Precision, Recall, and F1. The metrics precision, recall, and are used to evaluate the performance of the models on the whole testing dataset. And we denote them by Final P, Final R, and Final F1 to distinguish them from other PRFs with part of the testing data.
To specify, the testing set has the same data format as the training set which is constituted by groups. And the above metrics are computed according to the KBP slot filling tasks [31, 32] (on the entity level) rather than sentential classification.
6.2.2. Expectations for Each Relation
To show the inspirations for proposing biasdist, we computed the expectations (means) for each relation after initialization (before training epochs denoted by mean_b) as well as at the end of training (denoted by mean_e). The values were computed by averaging all the predicting scores for those instances that are classified to that relation. This process was carried out on MIMLRE to show the instance diversity problem that the algorithm may suffer from. We report the distributions of expectations with an error bar (Figure 2). In the figure, each circle denotes the average predicting expectation among all the training epochs for the relation corresponding to the axis, and the upper error and the lower error stand for the maximum and the minimum expectations during training. Thus, the uneven curve shows the diversities between relations. We see that the maximum average expectation is about 0.94 (index = 2, per:date_of_birth) but the minimum one is only 0.3 (index = 25, org:members). Since the classifier considers only the absolute predicting confidence (both in training and testing), it is likely that the actual relation label for an instance just gets a small predicting score. Hence, a relative predicting score is necessary due to the diversity.
Another interesting thing we observe is the upper and lower errors. The distance between the upper and lower error for one relation indicates the change of class center and members during training. We see that several relations have their predicting expectations almost unchanged during the whole training process. We guess one reasonable explanation is that the instances of these relations are indeed pure enough for classification, so that the labels for these instances may hardly change during EM.
6.2.3. Baselines
We compare our models with three baselines: Hoffmann, Mintz++, and MIMLRE. Hoffmann is one of the representative MIMLbased algorithms which uses deterministicor decision instead of relation classifiers and it also enables relation overlaps [3]. Minz++ [4] is a modified version of the original Mintz model [1] in which each mention is treated independently and multiple predictions are enabled by applying Noisyor. The performance of Mintz++ significantly outperforms the original Mintz model. As to MIMLRE, we choose the better model (also named as MIMLRE in [4]), which contains a modified version of level features from atleastone.
6.2.4. Results
We firstly report the curves of our proposed models compared with the three baseline methods mentioned above (Figures 3–5). The curves of the proposed methods are generated after tuning parameters on the developing set, aiming at maximizing Final .
For comparison, the best curve of MIMLdistclassify is tuned only based on the model generated by the last training epoch . From Figure 3 we read that MIMLdistclassify has higher precision scores in both the low and the high recall proportions compared with the baselines but is worse than Mintz++ in the low recall proportion (0.05~0.1). However, the precision of Mintz++ drops down fast when recall goes beyond 0.1, not as stable as the other methods. Thus we conclude that although MIMLdistclassify is simple, it shows that biasdist is beneficial to the final results.
MIMLdist has considerable good performance as MIMLdistclassify (Figure 4) especially in the low recall region (<0.15). We notice that when we fix the recall in 0.05–0.1, the precision of MIMLdist can be 5–10% points higher than MIMLRE. Other than MIMLdistclassify, the curve of MIMLdist has better overall performance than Mintz++.
Figure 5 shows the results generated by MIMLreward. We see that MIMLreward gains considerable improvements compared to MIMLRE in the lowrecall region (<0.1) but falls beneath the model as the recall increases. Similar to MIMLdist, MIMLreward performs better than Hoffmann and Mintz++ almost over all the recall proportions.
Final , Final , and Final are metrics that evaluate the methods on the whole testing set, which are also important performance measures in this literature. We can read from Table 2 that MIMLdistclassify improves the baselines by nearly 4% on recall while still keeping a relatively high precision. MIMLdist improves both precision and recall and achieves the maximum Final among all the models. MIMLreward has the maximum Final but its performance is at the cost of some precision points. We noticed that all the three methods we propose can enhance the baseline MIMLRE on by over 1.5%. And compared with the other baselines, Hoffman and Mintz++, we observed that Final is significantly improved by the proposed methods.
6.2.5. Case Study
We analyzed the predicted results of the proposed methods and compare them with those predicted by MIMLRE, which is a direct baseline of our work. To make it clear, we show in what kinds of cases our methods can make up the deficiency of the baseline.
Take one of the testing samples as an example: a sentence is predicted to org:city_of_headquarters with the probability of 0.56 and to the negative class label NIL with the probability of 0.43. According to the center of the positive class (0.82) which is far from 0.56, the sentence will not be predicted to the class any more after being normalized by biasdist. We also figured that several positive predictions were directly replaced by the negative class after adding biasdist, which is a contribution to the overall precision.
The effect of biasreward can be indirectly read from the training bags to some extent since it depends on the baglevel labels which cannot be extracted from the testing set. According to the EM algorithm in MIML, the only supervision (weak supervision) is the baglevel labels, and the algorithm follows: if a label is positive in a bag, its ranking is higher than any other label. Hence, if the baglevel label is potentially wrong, it is likely that the algorithm falls into local minimums. We counted the number of different label assignments in each training epoch for MIMLreward and MIMLRE and found that it is really a large number (i.e., 352,192 different assignments in 950,102 when ). We believe that this large number of differences can easily lead the training algorithm to converge to distinguishing directions.
Another thing we found is that the improvements distributed a bit evenly rather than focusing only on several specific relation labels. This indicates that the biases we propose are reasonable and efficient to all relations.
7. Discussion
We see that the proposed models work well on the whole testing dataset (Table 2) but from the curves we realize (Figures 3–5) that the improvements on different proportions of testing data are not so consistent, especially for MIMLdist and MIMLreward which perform much better at the low recall proportion but get a bit depressing when recall increases. We argue that there are several possible reasons: (1) the parameters (i.e., constants or biases) are tuned on the developing set to maximize the performance on Final but not the curve. So it is possible that other sets of parameters that do not perform well on Final may generate a better curve (we indeed validated this through changing the parameter ); (2) the cases of each relation are a bit different that a fixed parameter toward all relation classes is not quite appropriate (i.e., and in MIMLreward); it is likely that the parameters only work well over all the testing set rather than some proportion. We need to further improve the learning algorithm so that more noises can be reduced or discarded. In addition, the hard EM training process suffers from the local minimum problem and how to tackle it should be further developed.
Another phenomenon we notice is that MIMLdist and MIMLreward have lower time complexity than MIMLRE. MIMLdist achieves the best result when and MIMLreward gets the optimum when . It is believed that the biases especially biasreward heavily change the label assignments so that the algorithm can converge much faster. As a result, we improve the time efficiency of the MIML algorithm. MIMLreward only needs 45 hours’ running time, compared with MIMLRE whose training may last about 20 hours according to the authors.
Sometimes a simple method can achieve a good result, such as MIMLdistclassify, which only modifies the label assignment process in testing but boosts MIMLRE by 1.7% on . Besides, biasdist can be applied in any probability classification model and biasreward can also be integrated in any MIL framework which takes a bag as the basic training unit. However, we realize that there is still a long way for weak (distant) supervision to go since the results are still far behind what those supervised learning methods can achieve. Perhaps some more work can be down on either feature engineering or parameter selection.
8. Conclusion
In this paper, we propose three methods for distantly supervised relation extraction based on two types of biases. Among them, MIMLdistclassify and MIMLdist aim at tackling the instance diversity problem for different relations via adding bias items either in the testing step or in the training step. MIMLreward is introduced to model the baglevel label noise by adding rewards for wrong negative labels and penalties for wrong positive labels. Experimental results on a landmark dataset validate the effectiveness of the proposed methods, boosting Final by 1.5%–1.7%. In the future, more flexible approaches would be researched to model the noises caused by DS.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported in part by National 863 Program of China (2015AA015405) and NSFCs (National Natural Science Foundation of China) (61402128, 61473101, 61173075, and 61272383). The authors thank the anonymous reviewers for their insightful comments.