Abstract

Relation extraction is the task of extracting semantic relations between entities in a sentence. It is an essential part of some natural language processing tasks such as information extraction, knowledge extraction, question answering, and knowledge base population. The main motivations of this research stem from a lack of a dataset for relation extraction in the Persian language as well as the necessity of extracting knowledge from the growing big data in the Persian language for different applications. In this paper, we present “PERLEX” as the first Persian dataset for relation extraction, which is an expert-translated version of the “SemEval-2010-Task-8” dataset. Moreover, this paper addresses Persian relation extraction utilizing state-of-the-art language-agnostic algorithms. We employ six different models for relation extraction on the proposed bilingual dataset, including a non-neural model (as the baseline), three neural models, and two deep learning models fed by multilingual BERT contextual word representations. The experiments result in the maximum F1-score of 77.66% (provided by BERTEM-MTB method) as the state of the art of relation extraction in the Persian language.

1. Introduction

Relation extraction (RE) is the task of identifying semantic relations between text entities and is one of the most crucial tasks in natural language processing (NLP). In RE, entities are string literals that are marked in the sentence. Furthermore, the goal in RE is to detect a limited number of predefined relationships from the text. For example, suppose in an information extraction system, the goal is to extract companies located in Tehran from a text. The RE task needs to be performed using the “Located In” predicate and “Tehran” as the relationship’s object to enable this information to be extracted. Another example of the application of RE is its application in question answering. For instance, answering a question about the cause of an event can be thought of as a RE task in which the relationship is “Cause-Effect” and the object is “event.”

Knowledge base population is one of the applications of RE. A knowledge base contains a set of entities and relationships between them. There are many knowledge bases available in English at the moment, such as Yago [1], Freebase [2], DBpedia [3], and Wikidata [4]. However, the first knowledge base in the Persian language was developed recently [5], which was one of the motivations for this research.

At the outset, the first dataset for the Persian RE is introduced. Then, five language-agnostic methods for the RE task are employed, and the results of the method are compared with a baseline.

Although there are already standard RE datasets in English, such as SemEval-2010-Task-8 (Multi-Way Classification of Semantic Relations between Pairs of Nominals) [6], TACRED [7], and ACE 2005 [8], there is no dataset available for the Persian language. Thus, in this paper, we present “PERLEX,” which is an expert-translated version of the SemEval-2010-Task-8 dataset.

The evaluation and comparison of the selected RE methods are carried out using PERLEX1 dataset. It is rational to adapt existing state-of-the-art language-independent RE methods with our target language, i.e., Persian, by reimplementation. We use five neural RE models, including a model based on convolutional neural networks, two models based on recurrent neural networks, and two BERT-based models.

Before the introduction of BERT [9], the BLSTM-LET [10] model was one of the best models presented for the RE task. The application of Bidirectional LSTM Networks with Entity-Aware Attention using the Latent Entity Typing (BLSTM-LET) method for the RE task was regarded as one of the state-of-the-art language-agnostic approaches. BLSTM-LET outperformed previous state-of-the-art RE methods without the use of customized linguistic features, while it relies solely on word embedding [11] features.

With the advent of BERT, many tasks of NLP have evolved. BERT [9] is a contextual text representation model that was shown to achieve state-of-the-art results in 11 different NLP tasks. Unlike previous word representations where each word would have a fixed embedding, words have different embeddings with BERT in different contexts. At present, the BERTEM-MTB [12] model has shown that in both SemEval-2010-Task-8 and TACRED datasets, it is the state of the art of the RE task.

The remainder of this paper is organized as follows. Section 2 provides a summary of the related literature in the RE task. In Section 3, we elaborate on the design of our proposed dataset, PERLEX. Section 4 presents the experimental results along with further analysis of the obtained results. Finally, in Section 5, we conclude this paper and propose possible future lines of extension of this study.

2. Background

In the following section, we first provide a brief review of the well-known RE datasets. Then, we divide the state-of-the-art RE algorithms into two different categories: deep-learning-based methods and non-deep-learning-based methods. The performance of these models on the PERLEX dataset is reported in Section 4.

2.1. Datasets

RE datasets can be classified into two general groups: distantly supervised datasets and hand-labeled datasets.

In the hand-labeled datasets, the label of each relation mention is determined by human experts. Thus, the creation of such datasets is time-consuming and expensive. Datasets like ACE [8], SemEval-2010-Task-8 [6], TACRED [7], and FewRel [13] belong to the hand-labeled category.

When a knowledge base is available, distant supervision is a reasonable approach to generating labeled datasets. If the knowledge base in question is a specific-domain knowledge base, it is possible to produce specific-domain datasets. Simultaneously, if a general-domain knowledge base is available, general-domain datasets can be generated [14]. However, the labels of relation mentions in the distantly supervised datasets are determined by the corresponding relations of the mentioned pairs in a knowledge base. The most extensive use of distantly supervised dataset is in the approach proposed by Mintz et al. [15]. Additionally, NYT-10 [16], is the most widely used distantly supervised dataset that aligns entities in the New York Times corpus to entities in Freebase.

Distantly supervised datasets have some advantages over hand-labeled ones. For example, human experts do not need to do the time-consuming process of annotating distantly supervised datasets. Furthermore, distantly supervised datasets can utilize labels already used in knowledge bases, which makes such datasets ideal for knowledge-base-related tasks such as knowledge base population. The main disadvantage of these datasets is their noisy labels. There are many approaches proposed to deal with the problem of noisy labels, such as multiple-instance learning [1618], reinforcement learning [1921], the use of knowledge base side information [22, 23], and attention mechanism [24, 25]. In the Persian language, FarsBase [5] especially uses a distantly supervised method to extract triples for the knowledge base.

2.2. Non-Deep-Learning-Based Methods

Before the advent of deep learning models, NLP tasks relied on specific NLP tools such as dependency parsers and POS taggers for feature extraction. These models cannot compete with deep learning models due to the costly nature of their handcrafted features and resources. Nevertheless, these features generally are extracted by NLP tools, while some errors may be caused by themselves. These methods employ classifiers, such as SVM and Maximum Entropy (MaxEnt). The state-of-the-art method for RE was achieved by Rink and Harabagiu [26] on SemEval-2010-Task-8 dataset in 2010 using an SVM with several handcrafted features and resources including lexical resources, dependency, PropBank, FrameNet, Hypernym, NumLex-Plus, NGrams, and TextRunner [27]. Their model was the best non-deep-learning-based model. However, later, it was outperformed by deep-learning-based models such as CNN and RNN methods.

LightRel [28] is another non-deep-learning-based method, which is a fast and lightweight logistic regression classifier. In this method, a relation mention is represented as a sequence of tokens. The main idea of this method consists of transforming these sequences into vectors of fixed length such that each token (or word) is represented only by four features, including the word itself, its shape (a small, fixed number of character-based features), the word’s cluster id extracted from the external knowledge base, and the word’s embedding of fixed size. Then, a logistic regression classification model is trained to predict classes using feature vectors.

2.3. Deep-Learning-Based Methods

This section presents and describes some of the essential features of deep-learning-based models used for RE. Each of these models was state of the art at its time, but shortly afterward, it was outperformed by the next model.

Convolutional Neural Networks (CNNs) were initially used in computer vision to extract features from images, but they have recently been applied to various NLP tasks. Zeng et al. [29] used CNNs to extract features from sentences and classified their relations.

Attention-Based Bidirectional Long Short-Term Memory Network (Att-BLSTM) is a RE model proposed by Zhou et al. [30], which is capable of surpassing many state-of-the-art models without relying on NLP tools or lexical resources for feature extraction. The attention layer’s function, as the name suggests, denotes higher weights to words with higher importance, which results in distinguishing more important words from others. For instance, the word “the” is less useful than a word like “Caused” for determining the Cause-Effect relation in a sentence.

Bidirectional LSTM with Entity-Aware Attention using Latent Entity Typing (BLSTM-LET) proposed by Lee et al. [10] utilizes the self-attention introduced by Vaswani et al. [31] and latent entity typing to produce better representations of words. Additionally, A bidirectional LSTM is used for classification as well.

BERT-based models have recently been applied in the field of RE and have been able to obtain the best results up to now and outperform previous methods.

In contrast to context-free models such as word2vec, Bidirectional Encoder Representations from Transformers (BERT) [9] is an unsupervised context-dependent language representation model.

Enriching Pretrained Language Model with Entity Information (R-BERT) is a recently proposed model by Wu and He [32], which used BERT for the task of RE and showed to be the best method on the SemEval-2010-Task-8 dataset. To encode a relation between two entities of a sentence using BERT, R-BERT adds the special token “$” before and after the first entity and another special token “#” before and after the second entity in a given sentence. R-BERT also adds another special token “[CLS]” to the beginning of each sentence. The final representation of each relation is calculated by concatenating three hidden state vectors, including the hidden state vector corresponding to the [CLS] token and the averages of the hidden state vectors corresponding to the first and second entity tokens.

BERTEM with Matching the Blanks (BERTEM-MTB) is the most recent and current state-of-the-art approach, which is a BERT-based model and is very similar to R-BERT. This method was proposed by Soares et al. [12]. Similar to R-BERT, the BERTEM-MTB method added special tokens before and after the entities. Unlike R-BERT, the tokens before and after each entity are different in the BERTEM-MTB method. Relation representation in this model is the concatenation of the hidden states of the special tokens before each entity. These relation representations are then used to classify each sentence’s relation. This method also adds another training step in the architecture of fine-tuning relation representations in BERT by replacing the entities with “[BLANK]” special token in sentences for which their entity pairs are similar.

We summarize all the models described in this section in Table 1.

3. Construction of the PERLEX Bilingual Dataset

Unlike English and other rich-resource languages, the Persian language has no proper assets available for RE. In the English language, the SemEval-2010-Task-8 challenge is one of the most well-known datasets for RE, which has been utilized in many studies. This dataset contains 10717 example sentences and their corresponding relation types, from which 8000 are for training and 2717 are for testing. In this challenge, each relation extraction algorithm is asked to identify one of the nine predefined relationships in this dataset for the pair of entities specified in each sentence. The nine predefined relations are “Cause-Effect,” “Component-Whole,” “Content-Container,” “Entity-Destination,” “Entity-Origin,” “Instrument-Agency,” “Member-Collection,” “Message-Topic,” “Product-Producer,” and the relation “Other” in case there is not a confirmed relationship between the two entities. Frequency of each relationship in the corpus is illustrated in Table 2. Note that the portion of each relationship is precisely the same as the SemEval-2010-T8 dataset. For each sentence in the original (English) corpus, e1 always appears before e2, while their relationship has a direction and can be from e1 to e2 or vice versa. For example, both Cause-Effect (e1, e2) or Cause-Effect (e2, e1) may be seen in the corpus. Each sentence in PERLEX is a Persian translation of the original sentence. Due to the different nature of the grammar in Persian than English, the e1 and e2 sequence may not be maintained in some cases in PERLEX. In other words, unlike the original corpus, in which e1 always comes before e2, in some cases, the PERLEX contains sentences in which e2 appears before e1. The last column of the tables shows the number of such sentences. Alternatively, we can rename e2 to e1 and e1 to e2 in these sentences and reverse the relationship’s direction. Nevertheless, it costs mismatched relationships between PERLEX and its corresponding sentence in the original corpus.

PERLEX is the parallel translation of all examples in the SemEval-2010-Task-8 dataset. With this approach, the cost of sentence selection is eliminated. On the other hand, this dataset is constructed from an original and widely utilized dataset. Therefore, it is possible to implicitly compare the results of implementing RE methods on this dataset with those of the English dataset. Table 3 illustrates the statistics related to the PERLEX and the original corpus. For reasons related to translation, some of the statistics are different between the two corpora. Sentences in the PERLEX are longer than the SemEval-2010 Task 8 corpus. Moreover, while in some cases the translation of a Persian phrase is wordier than the original English phrase, the total number of words of all entities in PERLEX is more than the total number of words of all original corpus entities.

4. Experiments and Analyses of Relation Extraction

This section reports experimental settings and classification results of six different models: baseline, CNN, Att-BLSTM, BLSTM-LET, R-BERT, and BERTEM-MTB.

4.1. Experimental Setup

In PERLEX, we adapted the nine datasets similar to those in the SemEval-2010-Task-8 dataset mentioned in Section 2. Each class has two variations specifying the placement order of the subject and object in the sentence. For instance, Cause-Effect class possesses two variants: Cause-Effect (e1, e2) and Cause-Effect (e2, e1).

Generally, there are three ways to evaluate the classification results:(1)Taking into account both variations of each class (18 classes in total).(2)Using only one variation of each class (and considering directionality).(3)Using only one variation of each class (and ignoring directionality).

Moreover, there are two methods to measure F1-score, namely, micro-averaging and macro-averaging. Additionally, the pairs of entities that do not fall into any of the main nine classes are labeled as “Other” in the dataset and do not participate in evaluations. We adopt the official evaluation method for the SemEval-2010-Task-8 dataset, which is (9 + 1)-way classification with macro-averaging F1-score measurement while directionality is taken into account. This (9 + 1)-way means that we use the nine main classes plus “Other” in training and testing, but “Other” is ignored to calculate the F1-scores. In all non-BERT-based experiments, we use 300-dimensional word embeddings pretrained by Poostchi et al. [33] and utilize 10% of the training set as the development set.

4.2. Overall Results

Figure 1 illustrates the official F1-scores for each model. As we expected, the results are lower for the Persian language in comparison with the English language. This drawback is due to many challenges in the processing of the Persian language as a free-word-order and a more ambiguous language. The Persian language has some specific features, which can generate some challenges. Some of these specifications are as follows [34]:(i) Exceptions in word order(ii) Exceptions in the agreement between tense and aspect of the verbs(iii) Being either verb-final or mostly head-initial(iv) Being a derivational and generative language(v) Unwritten short vowels at most cases(vi) Lack of definite articles for nouns(vii) No female/male distinction for pronouns(viii) Semantic symmetry and omitted phrases(ix) No rule for appearing uncountable nouns in the singular form

Additionally, there are many challenges in Persian language understanding [35]. Similar to English, the performance of BERTEM-MTB overcomes that of all the other five methods in the Persian language. Moreover, BERTEM is the superior method in all nine classes in the Persian language. Note that word embeddings and BERT models used in the Persian language experiments are pretrained on smaller text corpora than the corpora used for pretraining in English.

4.2.1. Baseline

We use features to train a logistic regression classifier (LRC) and L2R_LR solver as the baseline for the Persian language, such as Word IDs (unique IDs for each word in the dataset), Part of Speech (POS), tags for each pair of entities, words between two entities, POS tags of the words between two entities, dependency relations and their direction between two consecutive entities, POS word tags between two entities, and the bag of words.

Based on the obtained results, the official F1-score for logistic regression on PERLEX dataset is 57.42%. It should be noted that we report logistic regression baseline method of LightRel [28] results for English.

4.2.2. CNN

For the CNN model, we use four different kernel lengths: 2, 3, 4, and 5. Then, we concatenate the outputs of these kernels. We set the number of kernels for each length to 128. We also use dropout [36] and L2 regularization to prevent overfitting. All of the hyperparameters used in this experiment are reported in Table 4. Based on the obtained results, the official F1-score of CNN on PERLEX dataset is 69.28%.

4.2.3. Att-BLSTM

We used one layer of bidirectional LSTM and set the hidden state size to 100. To prevent overfitting, we use L2 regularization, recurrent dropout, and regular dropout. All of the hyperparameters used in this experiment are reported in Table 5. Based on the obtained results, the official F1-score of Att-BLSTM on PERLEX dataset is 69.61%.

4.2.4. BLSTM-LET

We use four attention heads in the multi-head attention layer and set the layer size to 50 for each head. The hidden state of the LSTM is set to 300. Like the previous model, recurrent and regular dropout, as well as L2 regularization, are used. All of the hyperparameters used in this experiment are reported in Table 6. Based on the obtained results, the official F1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 70.79%.

4.2.5. R-BERT

We fine-tune the base BERT pretrained model for this method. Other hyperparameters can be seen in Table 7. Based on the obtained results, the official F1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 75.31%.

4.2.6. BERTEM-MTB

We fine-tune the base BERT pretrained model for this method. The other hyperparameters can be seen in Table 7. Based on the obtained results, the official F1-score of BLSTM with entity typing (BLSTM-LET) on PERLEX dataset is 77.66%.

4.3. Results per Class

The final results for individual classes can be seen in Table 8. As can be seen, the F1 measure of the BERTEM-MTB model is higher than that of the other models in all classes. In almost all classes, the value of F1 has risen from the lowest in baseline to the highest in R-BERT. However, the models do not behave the same in the Instrument-Agency class, which means that the baseline model is better than all models except BERTEM-MTB. This is because the baseline model uses dependency relations and their direction between two consecutive entities for this purpose, while other models do not use this information. Sentences containing the Instrument-Agency relation class are very similar in terms of the dependency tree. Consequently, the baseline model, which uses dependency tree information, has learned how to detect this kind of relationship by observing a similar pattern. Note that some studies targeted specific-domain texts [37, 38]. Although SemEval contains general domain texts, Jackson et al. [39] evaluated BioBERT on this corpus. However, such models are specially trained on biomedical and scientific texts in English and cannot be used directly on the Persian texts. Moreover, to the best of our knowledge, there is no similar model available in the Persian language.

5. Conclusion

In this paper, the relation extraction (RE) task in the Persian language is conducted for the first time. For this purpose, we initially proposed a bilingual version of the SemEval-2010-Task-8 dataset, dubbed as PERLEX. Then, having investigated the state-of-the-art language-agnostic methods for RE in the English language, we adapted and customized some of these methods for the Persian language. Moreover, a logistic regression algorithm with syntactic and semantic features is employed as a baseline. Although we have used the best open-source tools for POS and dependency parsing in the Persian language in the baseline methods, in the future, we will develop some new tools for these tasks based on the state-of-the-art methods which are designed for the other languages. The acquired experimental results double-proved the BERT-based models’ superiority over the baseline and other deep learning models and proved comparability to similar state-of-the-art methods in English. However, due to particular challenges in the processing of the Persian language, such as being free-word-order and having ambiguity-prone nature compared with English, the performance of Persian’s customized methods was less than their performance in English.

As the future work of this study, more accurate Persian word embedding can be presented and applied to improve non-BERT-based models’ results. Moreover, by designing training steps tailored to the Persian language features, a novel BERT-based RE model can be proposed for the Persian language.

Data Availability

The PERLEX dataset used to support the findings of this study have been deposited in the following repository: http://farsbase.net/PERLEX.html.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

We acknowledge the active collaboration of Dr. Sayyed Ali Hossayni and Mr. Kamyar Darvishi, who kindly collaborated with us during this study.