Abstract

Extracting the relations between medical concepts is very valuable in the medical domain. Scientists need to extract relevant information and semantic relations between medical concepts, including protein and protein, gene and protein, drug and drug, and drug and disease. These relations can be extracted from biomedical literature available on various databases. This study examines the extraction of semantic relations that can occur between diseases and drugs. Findings will help specialists make good decisions when administering a medication to a patient and will allow them to continuously be up to date in their field. The objective of this work is to identify different features related to drugs and diseases from medical texts by applying Natural Language Processing (NLP) techniques and UMLS ontology. The Support Vector Machine classifier uses these features to extract valuable semantic relationships among text entities. The contributing factor of this research is the combination of the strength of a suggested NLP technique, which takes advantage of UMLS ontology and enables the extraction of correct and adequate features (frequency features, lexical features, morphological features, syntactic features, and semantic features), and Support Vector Machines with polynomial kernel function. These features are manipulated to pinpoint the relations between drug and disease. The proposed approach was evaluated using a standard corpus extracted from MEDLINE. The finding considerably improves the performance and outperforms similar works, especially the f-score for the most important relation “cure,” which is equal to 98.19%. The accuracy percentage is better than those in all the existing works for all the relations.

1. Introduction

Biomedical information is abundantly available in journal articles and research studies in various databases, such as MEDLINE, PubMed, and Medscape. Scientists need to automatically extract relevant information, for instance, semantic relations between medical entities, from these databases. For example, scientists need to know which drug cures a given disease or which diseases are the side effects of a given drug. These relations can help specialists update their knowledge and improve their expertise in their field. These relations can be discovered from a variety of texts in biomedical literature.

Various methods have been applied to extract relations from the biomedical literature [15]. The relationship extraction studies have focused on specific types of relations, including interactions between protein and gene, protein and protein [6], drug and disease, and drug and drug [7].

Therefore, the objective of this study is to contribute to a better understanding of drug-disease relation. This paper aims to explore the extraction of drug-disease relation from biomedical texts. The paper proposes a semantic relation extraction approach between biomedical entities (drug and disease) which exploits the specific features of these entities, which can be discovered by using a suggested NLP technique and UMLS ontology. These extracted features will form the input to the Support Vector Machine (SVM) classifier for the classification of relations between these entities.

2.1. Extraction of Relations between Medical Concepts

Many different biomedical text relation extraction strategies have been proposed to discover relationships, including protein and protein, gene and gene, gene and protein, gene and disease, gene and drug, and drug and drug.

The works about protein-protein relation extraction are generally based on the identification of protein features (lexical features) rather than similarity methods [810] or classification methods [11], which are applied to discover the interaction between pairs of proteins.

For gene-gene relation extraction, the researchers focused on the use of ontologies, such as Gene Ontology [12] or statistical models [13, 14].

To identify gene-protein relations, various works have proposed the use of machine learning and NLP techniques [15, 16, 17].

To discover gene-disease relations, classification models that support these relationships were built [18]. In other works, NLP tools and ontologies were exploited [1921].

For gene-drug relation extraction, various works recommended text mining approaches supported by classification models [22].

In the domain of drug-drug relation extraction, various woks proposed approaches based on NLP techniques [23, 24]. Recent remarkable contributions focused on neural networks [2528].

To discover drug-side effect relation, dictionaries and ontologies were built from the Unified Medical Language System (UMLS) Metathesaurus [29].

2.2. Drug-Disease Relation Extraction

Discovering the relationship between drugs and diseases plays a crucial role in medical domain development. The huge medical literature sources allowed the automatic identification of significant relations hidden in free text. Various computational methods have been proposed to discover the relations between drugs and diseases.

Rosario and Hearst [30] proposed a method that distinguishes seven relations between two semantic entities, “treatment” and “disease.” Five graphical models and a neural network have been presented. Seven relations were detected, but only three relations, namely, cure, prevent, and side effect, were represented with accuracy levels of 92.6, 38.5, and 20, respectively.

Abacha and Zweigenbaum [31] suggested a hybrid approach associating a pattern-based method and a statistical-based learning method (linear SVM) to extract two relations between a disease and a treatment. F-scores were given as effectiveness measure, and they are 95 and 15.15 for cure and prevent, respectively.

Frunza et al. [32, 33] have applied a machine learning technique to extract diseases and treatments from medical papers. Six classification algorithms were used, including probabilistic models, adaptive learning models, decision-based models, and linear classifiers like SVM. Three data representation techniques were adopted to extract treatment relations as follows: Bag-of-Word, NLP, and medical concepts. The effectiveness measures of the three detected relations, namely, cure, prevent, and side effect, are 93.6, 76.5, and 50, respectively.

Suchitra and Sudah [34] used NLP and machine learning techniques to extract relations between drugs and treatments. Rule-based approaches, statistical models, and logic techniques were used for cooccurrence analysis. A Bloom filter was applied to remove unwanted data. Naive Bayes, SVM, inductive logic techniques, and statistical models were used. The obtained results had an overall F-score of 90.3 and an overall accuracy of 90 for the three extracted relations, namely, cure, prevent, and side effect.

Muzaffar et al. [35] used the Unified Medical Language System and ranking algorithms to rank verb phrases. The relations between drugs and treatments were classified using SVM and Naive Bayes techniques. Three relations were detected, namely, cure, prevent, and side effect. The F-scores were 98.05, 93.55, and 88.89 for cure, prevent, and side effect, respectively. The accuracies were 96.1, 97.4, and 96.4 for cure, prevent, and side effect relations, respectively.

Wang et al. [36] suggested a pattern-based relationship extraction method to extract two types of relations between drugs and diseases, namely, treatment (a drug treats/cures a disease) and inducement (the side effect of a drug). They created a drug and disease lexicon from the UMLS and used drug-disease pair seeds for the pattern-based method to extract the relations between drugs and diseases. The reported results showed an F-score of 90.49 for cure relation and an F-score of 87.56 for the side effect relation.

Some researchers proposed a relation extraction between three concepts, namely, drug, disease, and protein [37] or drug, disease, and gene [38]. Other researchers have focused on a particular disease or a particular drug when looking for relations, for example, the extraction of treatments for psoriasis [39], the association between diabetes and the treatments for diabetes [40], and the effect of estrogen replacement therapy on Alzheimer’s disease and Parkinson’s disease [41].

Table 1 shows a comparison between the most important works in the field of relation extraction between drugs and diseases.

The existing works based on drug-disease relationship did not take into account many important features about drugs and diseases. These features (frequency features, lexical features, morphological features, syntactic features, and semantic features) can be very useful for the detection of good and valuable relations.

To overcome this issue, we proposed a novel methodology that discovers the drug-disease association based on Natural Language Processing strategy with the help of the UMLS ontology and a machine learning technique, such as the SVM model, for automatic relations extraction from biomedical texts.

3. The Proposed Approach

The methodology adopted in this study was developed from studies and concerns related to relation extraction from medical literature, text mining, and machine learning. The proposed approach entailed three main components, namely, preprocessing, features extraction, and relation extraction. The first component started with free-text sentences, performs a preprocessing task, and outputs a set of annotated words. The second component identified various features about sentences, which later helps the relation extraction. Thirdly, the output of the previous component was fed into a machine learning component, thereby completing the identification of associations between drug and disease entities. The architecture of the proposed approach, named “DDRel,” is shown in Figure 1.

The steps outlined in Figure 1 are discussed in detail in the following subsections.

3.1. Preprocessing

Preprocessing, the first step of the approach, was based on Natural Language Processing (NLP) techniques. It eliminated noisy data and outputs all words in medical texts related to the biomedical concept (treatments and diseases). It included four major stages, namely, (i) splitting, (ii) tokenization, (iii) part-of-speech tagging, and (iv) semantic annotation.

3.1.1. Sentence Splitting

This step divided texts into smaller units, and an identifier is assigned to each unit. Texts are segmented into sentences using punctuation markers ”,” “?,” and “!” In this step, ANNIE English Sentence Splitter was used as a cascade of finite-state transducers to spill the text into sentences, as shown in Figures 2(a) and 2(b).

3.1.2. Tokenization

After the sentence splitting, each sentence was segmented into tokens. Tokenization is the segmentation of sentences into a sequence of words using nonalphabetic characters, such as alien break, space, or punctuation characters. The result of tokenization was presented as an XML file that gathers tokens associated with the following: (i) the sentence identifier (id-sentence); (ii) the token identifier (id); (iii) the token length (length); (iv) the token orthography (orth); (v) the token kind (kind); and (vi) the token (string). The display of the XML file for the user is presented in Figure 3(a).

3.1.3. Part-of-Speech Tagging

Part-of-Speech (POS) tagging is the method of associating words in a text according to their grammatical function, definition, and context, such as noun (NN), verb (VB), adjective (JJ), conjunction (CC), and proper noun (NNS). The algorithm of ANNIE POS Tagger has been implemented. The output is an XML file, in which each word has association with its grammatical function. The display of the XML file for the user is presented in Figure 3(b).

3.1.4. Semantic Annotation

This step involved the extraction of named entities of drugs and diseases. It was difficult to extract drugs and diseases for many reasons. Each medical concept can be identified by several synonymous, different terms, and abbreviations. Moreover, simple dictionaries cannot be used for new drugs and diseases in our context.

The Meta-Map system was configured to detect the concepts of the UMLS Metathesaurus hidden in the biomedical texts. The UMLS is a medical ontology that originated from the National Library of Medicine. The output of this step was the identification of concepts as Concept Id, Concept name, Preferred Name, and Semantic Type. The most important information extracted from this step was the Semantic Type, which was defined in UMLS. This significant knowledge will help determine the nature of concepts of drugs or diseases. Figure 4 shows the results of the semantic annotation.

3.2. Feature Extraction

The feature extraction was the second step of the proposed approach. It sets features as combinations of some characteristics and is inspired by Rosario and Hearst [30] in relation to the semantic type. The features for each word in a sentence were as follows: the semantic types, such as Word, Part of Speech (POS), and Phrase Constituent, belong to the same chunk as in the previous work; the MeSH mapping of the words; Domain Knowledge; and morphological features.

In this work, unlike that of Rosario and Hearst [30], the features were built for each sentence instead of each token.

Moreover, new kinds of features were created, and these were assumed to be more suitable for extracting drug-disease relations.

In this work, new features were proposed to extract drug-disease relations, including the following: (i) frequency features, (ii) lexical features, (iii) morphological features, (iv) syntactic features, and (v) semantic features.

3.2.1. Frequency Features

The frequency features represented the following:(i)The number of named entities (NEs) present in the sentence (named entities are drugs and diseases)(ii)The count of nouns indicating a drug that exists in the sentence(iii)The count of nouns indicating diseases that exist in the sentence(iv)The count of verbs establishing the relation among every two NEs in the sentence(v)The word count among every two NEs in the sentence(vi)The count of lemmas bonding each two NEs in the sentence(vii)The Bag-of-Word: word groups in a sentence with the number of occurrences of each word

3.2.2. Lexical Features

Lexical features were as follows:(i)Order of words present in NE(ii)Order of words present in every two NEs(iii)Sequence of “n” words preceding every NE(iv)Sequence of “n” words after every NE

3.2.3. Morphological Features

In this step, morphological features were extracted and included the following:(i)Lemmas order of the words among every two NEs(ii)Lemmas order of the “n” words preceding every NE(iii)Lemmas order of the “n” words after every NE

3.2.4. Syntactic Features

These features concern the POS of each NE and include the following:(i)POS order of words among every two NEs(ii)POS order of “n” words preceding each NE(iii)POS order of “n” words after every NE(iv)Verb sequence among every two NEs(v)First verb preceding every NE(vi)First verb after every NE

3.2.5. Semantic Features

The purpose of this step is to extract the combination of words in the sentence. The values of these semantic types are DIS (DISease) and TREAT (TREATment).

(1). Example of Feature Extraction. Consider the following sentence:

“Preliminary evidence suggests that interferons beta may also induce regression of metastatic renal cell carcinoma.”

The output of the feature extraction step from this sentence is provided in detail in Table 2.

The result of feature extraction is displayed for the user in Figure 5.

3.3. Relation Extraction

The relationships between drug and disease were extracted using a machine learning classifier. The relation extraction process is based on a classification process, which proceeded according to relation classes, as follows: CURE, PREVENT, SIDE EFFECT, NO CURE, and OTHER RELATION.CURE: TREAT CURE DIS. Example: Intravenous immune globulin for recurrent spontaneous abortion.PREVENT: TREAT PREVENTs the DIS. Example: Statins for the prevention of stroke.SIDE EFFECT: DIS is the output of the TREAT. Example: Malignant mesodermal mixed tumor of the uterus due to radiation therapy.NO CURE: Not curable DIS by TREAT. Example: The resistance of head lice to some insecticides was revealed.OTHER RELATION: ONLY DIS (TREAT is not mentioned), ONLY TREAT (Dis is not mentioned), and VAGUE (very unclear relationship).

This classification helped extract relations between entities and was performed by exploiting the extracted features and outputs of the previous step. The traditional machine learning classification techniques performed poorly when the classified data were immense. Therefore, this approach used an SVM, which scaled up relatively well to high-dimensional data [3].

SVM is a well-known supervised learning algorithm. The input of this algorithm is a set of features detected from the previous step. These features are used by a machine learning method to find a hyperplane that separates the feature space into classes with a maximum margin. When maximizing the margin, the SVM algorithm attempts to achieve maximum separation between classes and then minimize misclassification errors.

In this paper, a supervised classifier SVM was used to classify the drug-disease relations from biomedical databases. The objective of SVM was to discriminate between classes of relations. SVM was used with polynomial kernel, because this type of SVM has a kernel function and is very well suited with our context.

The first step of relation extraction was to provide the classifier a training set. The training set is composed of feature vectors. It was labeled data assigning a relation class for each sentence as follows: CURE, PREVENT, SIDE EFFECT, NO CURE, and OTHER RELATION. The training set is used by SVM to build a model that predicts the target relation class.

The second step was the prediction. To predict the relation class for each sentence in the data file, SVM applies the model on feature vectors, already created in the preprocessing step and semantic annotation. These vectors gather all the features related to each sentence in this data file (one vector for each sentence). Figure 6 shows the results of the relation extraction. By clicking on drug-disease relations extraction, the list of drugs and a list of relations are displayed. Alternatively, when choosing a drug and a type of relation (prevent, cure…), the diseases that have such a relationship with this drug are displayed.

4. Results and Discussion

4.1. Experiment Setup

To validate the proposed approach, a system was implemented. Screenshots are presented in Figures 26. For the experiments, we used the standard corpus obtained from MEDLINE 2001. This corpus was annotated with types of semantic relationships between treatment (TREAT) and a disease (DIS). These relationships were CURE, PREVENT, SIDE EFFECT, and NO CURE. This corpus was validated using MEDLINE 2001 Database of biomedical papers [30]. The corpus was used to guarantee the validity of the comparison of the results.

4.2. Results

For the evaluation, performance measures were deduced from a confusion matrix, which is a matrix shown in Table 3 with rows and columns and with the following classes: False Positives (FP), False Negatives (FN), True Positives (TP), and True Negatives (TN). A particular row in the matrix recorded the instances in an actual class, and each column recorded the instances in the predicted class.

The confusion matrix for the implemented system is for multiclass classification as shown in Table 4.

For the class CURE, 785 TP classes exist, because they are CURE classes and are predicted as CURE classes. The number of FN classes is 25 = (10 + 5 + 10), because they belong to the class CURE, but they are not predicted as such. The number of FP classes is 4 = 2 + 1 + 1, because they are predicted as CURE classes, but they are not. The number of TN classes is 82 = 57 + 25 + 0, because they are not predicted as CURE classes, and they are not.

From the above matrix (Table 4), recall, precision, f-score [42], accuracy [43], and specificity [44, 45] were computed as follows:

The obtained results were compared with similar works, for example, those of Rosario and Hearst [30], Abacha and Zweigenbaum [31], Suchitra and Sudah [34], Muzaffar et al. [35], Wang et al. [36], and Yu et al. [36] (Tables 59), because they worked on the extraction of the same semantic relations. These works were validated using the same standard corpus obtained from MEDLINE 2001.

For recall and precision, only the results of Abacha and Zweigenbaum [31] and Wang et al. [36] were available. The recall and the precision of the class NO CURE for Abacha and Zweigenbaum [31] were not available (NA). The recall and the precision of the class NO CURE for Abacha and Zweigenbaum [31] were not available (NA). Also, the recall and the precision of the classes PREVENT and NO CURE for Wang et al. [36] were not available, because this work was interested only in two relations, namely, CURE and SIDE EFFECT.

The recall in Table 5 shows that Abacha and Zweigenbaum [31] had a better recall (100%) compared with Wang et al. [36] (89.8%) and with a proposed approach (96.91%) for the extraction of CURE relation. For the rest of the relations, the proposed approach performed better. Also, for precision measures in Table 6, the proposed approach performed better than those in the works of Abacha and Zweigenbaum [31] and Wang et al. [36].

Table 7 is the table that reports the significant results of the previous works. For the proposed approach, an F-score of 98.19% was reported for the extraction of CURE relation, 85.71% for PREVENT, 79.37% for SIDE EFFECT, and 0% for NO CURE. The F-score of the proposed approach outperformed all those in similar works. The performance was largely diminished in the case of NO CURE for all works because of the lack of training examples.

The F-score measure was not reported in the work of Rosario and Hearst [30]. Also, the F-score measure of the class PREVENT was not available for Wang et al. [36], because this work was interested only in two relations. Moreover, the F-score measure of the class SIDE EFFECT was not available for Abacha and Zweigenbaum [31].

The accuracy measure was not available for the works of Abacha and Zweigenbaum [31], Frunza et al. [33], and Wang et al. [36]. Nevertheless, the results demonstrated in Table 8 show that the proposed approach achieved a higher accuracy compared with all similar works for all the relations. The accuracy of NO CURE was not reported in any work except in the proposed approach.

Table 9 represents the specificity measure of the implemented system. This measure is not available in the other works.

The results computed in this study were promising and showed that the combination of the used techniques outperforms the majority of the previous approaches using the same corpus. The possible reasons for this aspect are the appropriate mixture of the suggested NLP technique and UMLS ontology in the detection of relevant features (frequency features, lexical features, morphological features, syntactic features, and semantic features) for drug and disease and machine learning methods (SVM). The proposed approach seems to be suitable when dealing with semantic relations in natural language texts.

The novel idea presented in the study is the integration of a novel NLP approach reinforced by the UMLS ontology and a machine learning method that performed better in a multidimensional context.

5. Conclusion and Future Work

We proposed a novel computational approach for relation extraction between drugs and diseases from a biomedical corpus of texts. This approach combined a suggested NLP technique supported by the UMLS ontology for generating features. These features were conveyed to a machine learning model, that is, the Support Vector Machine, with polynomial kernel function for disease-drug relation extraction.

This study significantly contributed to the existing literature on relation extraction between drugs and diseases from the medical literature. The main contribution of this work is the identification of specific features (lexical, semantic…) related to medical concepts (drug and disease). This finding is a confirmation that, in the field of text mining, these features are relevant for the discovery of interesting relationships between concepts.

The experimental results proposed an improvement in the performance compared with other similar works.

The upcoming research will focus first on further improvements of the proposed approach. More investigations on the features of medical concepts will be conducted. Then, the next direction will focus on updating the method to assist the professionals in finding relevant and authentic information in extracting semantic relations between other medical entities.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to acknowledge the support of Taif University Researchers Supporting Project (no. TURSP-2020/292), Taif University, Taif, Saudi Arabia.