Research Article

Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords

Table 1

Features obtained directly from sentences.

Features Definitions/remarks Values Examples

Keywords Words representing relationship between two proteins One of the 180 kinds of words obtained by stemming 642 kinds of words such as “interact”, “bind”, “active”, and “depend”, observed frequently in sentences describing PPI

Distance between protein pair and keyword: three types The word distance defined by the number of words appearing between keyword mentioned above and protein names constituting the protein pair; Type 1 is the distance between and , Type 2 is the distance between and , and Type 3 is the distance between and Integer value In sentence “P1 is driven by P2”, if “driven” is the keyword, Type 1, Type 2, and Type 3 are 1, 1, and 3, respectively

Position of keyword: three types The word order of protein pair and keyword “Infix” (the order of the sentence is [--]), “prefix” (the order of the sentence is [--]), or “postfix” (the order of the sentence is [--]) In sentence “P1 is driven by P2”, the feature value is “infix”

Position of protein names The value adding word distance between the word at the beginning of the sentence and the protein name to one; positions 1 and 2 are defined for and Integer value In sentence “P1 is driven by P2”, Position 1 is 1 and Position 2 is 5

Comma between keyword and protein pair: four types Since the topic often changes before and after a comma, we use such information if there is any comma between the keyword and the protein pair “yy”, “nn”, “yn”, or “ny” (e.g., “yy” means commas are observed between and and between and , where , , and denote three words: the keyword and two protein names constituting the protein pair) In sentence “P1 is driven by P2”, the feature value is “nn”

Negative words Whether any negative word such as “not”, “unable”, or “incapable” appears between the protein names, or the keyword and the protein name “True” or “false” In sentence “P1 is not driven by P2”, the feature value is “true”

Conjunctive words Whether one of the following 16 kinds of words representing conjunctive relations appears: “where”, “when”, “what”, “why”, “how”, “as”, “though”, “although”, “because”, “so”, “therefore”, “hence”, “since”, “wherein”, “whereas”, and “whereby” “True” or “false” In sentence “P1 is not driven by P2”, the feature value is “false”

“Which” Whether “which” appears; since “which” also represents the conjunctive relation but occurs more frequently than the 16 words mentioned above, we distinguish “which” from the above features “True” or “false”

“But” Whether “but” appears; in addition to “which”, “but” also frequently represents the conjunctive relation; however, “but” introduces negation to the context “True” or “false”

Words representing assumptions or conditions Whether “if” or “whether” appears between the protein names or the keyword and the protein name “True” or “false”

Preposition of keyword The preposition following the keyword providing that the word distance between the keyword and the preposition is within 3; if there are many prepositions, the preposition is used whose word distance from the keyword is nearer One of the prepositions In sentence “P1 is driven by P2”, the feature value is “by”

Multiple occurrences of keywords Whether there is more than one keyword in a sentence “true” or “false” In sentence “P1 is driven by P2”, the feature value is “false”

Second keywords: seven kinds Only one of seven particular words: “bind”, “interact”, “regulate”, “induce”, “stimulate”, “associate”, and “known” is not selected as a keyword, whether that word appears between the protein names; compared with other keywords, these seven words can be regarded as particularly important in PPI information and this feature prevents them from being overlooked as keywords “True” or “false” for each of the seven words (if some of these seven words appear in the sentence and are not selected as a keyword, we use “true” as a feature value for them) In sentence “P1 binds P2”, since “bind” is already selected as a keyword, the feature value of the second keyword (bind) is “false”; since no other words are included in the sentence, the feature value of each is also “false”

Parallel expression of protein pair Whether the protein names constituting the protein pair are adjacent (they are also considered adjacent even if “—”, “/”, “and”, “or”, “(” appears between them); if protein names are expressed in parallel in a sentence, interaction between them is difficult; we can easily determine the parallel expression of a protein pair in a sentence by determining whether these protein names are adjacent in the word order of that sentence “True” or “false” In sentence “Protein binds P1 or P2”, the feature value is “true”