Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords
Table 1
Features obtained directly from sentences.
Features
Definitions/remarks
Values
Examples
Keywords
Words representing relationship between two proteins
One of the 180 kinds of words obtained by stemming 642 kinds of words such as “interact”, “bind”, “active”, and “depend”, observed frequently in sentences describing PPI
Distance between protein pair and keyword: three types
The word distance defined by the number of words appearing between keyword mentioned above and protein names constituting the protein pair; Type 1 is the distance between and , Type 2 is the distance between and , and Type 3 is the distance between and
Integer value
In sentence “P1 is driven by P2”, if “driven” is the keyword, Type 1, Type 2, and Type 3 are 1, 1, and 3, respectively
Position of keyword: three types
The word order of protein pair and keyword
“Infix” (the order of the sentence is [--]), “prefix” (the order of the sentence is [--]), or “postfix” (the order of the sentence is [--])
In sentence “P1 is driven by P2”, the feature value is “infix”
Position of protein names
The value adding word distance between the word at the beginning of the sentence and the protein name to one; positions 1 and 2 are defined for and
Integer value
In sentence “P1 is driven by P2”, Position 1 is 1 and Position 2 is 5
Comma between keyword and protein pair: four types
Since the topic often changes before and after a comma, we use such information if there is any comma between the keyword and the protein pair
“yy”, “nn”, “yn”, or “ny” (e.g., “yy” means commas are observed between and and between and , where , , and denote three words: the keyword and two protein names constituting the protein pair)
In sentence “P1 is driven by P2”, the feature value is “nn”
Negative words
Whether any negative word such as “not”, “unable”, or “incapable” appears between the protein names, or the keyword and the protein name
“True” or “false”
In sentence “P1 is not driven by P2”, the feature value is “true”
Conjunctive words
Whether one of the following 16 kinds of words representing conjunctive relations appears: “where”, “when”, “what”, “why”, “how”, “as”, “though”, “although”, “because”, “so”, “therefore”, “hence”, “since”, “wherein”, “whereas”, and “whereby”
“True” or “false”
In sentence “P1 is not driven by P2”, the feature value is “false”
“Which”
Whether “which” appears; since “which” also represents the conjunctive relation but occurs more frequently than the 16 words mentioned above, we distinguish “which” from the above features
“True” or “false”
“But”
Whether “but” appears; in addition to “which”, “but” also frequently represents the conjunctive relation; however, “but” introduces negation to the context
“True” or “false”
Words representing assumptions or conditions
Whether “if” or “whether” appears between the protein names or the keyword and the protein name
“True” or “false”
Preposition of keyword
The preposition following the keyword providing that the word distance between the keyword and the preposition is within 3; if there are many prepositions, the preposition is used whose word distance from the keyword is nearer
One of the prepositions
In sentence “P1 is driven by P2”, the feature value is “by”
Multiple occurrences of keywords
Whether there is more than one keyword in a sentence
“true” or “false”
In sentence “P1 is driven by P2”, the feature value is “false”
Second keywords: seven kinds
Only one of seven particular words: “bind”, “interact”, “regulate”, “induce”, “stimulate”, “associate”, and “known” is not selected as a keyword, whether that word appears between the protein names; compared with other keywords, these seven words can be regarded as particularly important in PPI information and this feature prevents them from being overlooked as keywords
“True” or “false” for each of the seven words (if some of these seven words appear in the sentence and are not selected as a keyword, we use “true” as a feature value for them)
In sentence “P1 binds P2”, since “bind” is already selected as a keyword, the feature value of the second keyword (bind) is “false”; since no other words are included in the sentence, the feature value of each is also “false”
Parallel expression of protein pair
Whether the protein names constituting the protein pair are adjacent (they are also considered adjacent even if “—”, “/”, “and”, “or”, “(” appears between them); if protein names are expressed in parallel in a sentence, interaction between them is difficult; we can easily determine the parallel expression of a protein pair in a sentence by determining whether these protein names are adjacent in the word order of that sentence
“True” or “false”
In sentence “Protein binds P1 or P2”, the feature value is “true”