Machine Learning in Mobile Computing: Methods and ApplicationsView this Special Issue
Open Relation Extraction in Patent Claims with a Hybrid Network
Research on relation extraction from patent documents, a high-priority topic of natural language process in recent years, is of great significance to a series of patent downstream applications, such as patent content mining, patent retrieval, and patent knowledge base constructions. Due to lengthy sentences, crossdomain technical terms, and complex structure of patent claims, it is extremely difficult to extract open triples with traditional methods of Natural Language Processing (NLP) parsers. In this paper, we propose an Open Relation Extraction (ORE) approach with transforming relation extraction problem into sequence labeling problem in patent claims, which extract none predefined relationship triples from patent claims with a hybrid neural network architecture based on multihead attention mechanism. The hybrid neural network framework combined with Bi-LSTM and CNN is proposed to extract argument phrase features and relation phrase features simultaneously. The Bi-LSTM network gains long distance dependency features, and the CNN obtains local content feature; then, multihead attention mechanism is applied to get potential dependency relationship for time series of RNN model; the result of neural network proposed above applied to our constructed open patent relation dataset shows that our method outperforms both traditional classification algorithms of machine learning and the-state-of-art neural network classification models in the measures of Precision, Recall, and F1.
With the development of economy, patent documents, being an extremely important knowledge carrier, record a large number of valuable inventions, creative ideas, and excellent design concepts. Automatically extracting none predefined relation triples from patent claims, which contains a series of rights granted by a government for a given limited period, is a vital basic research application for some upper level applications of patent document analysis, such as patent information retrieval [1, 2], patent classification , patent categorization , and patent knowledge graph construction .
However, relation extraction from patent document is not an easy task. On one hand, specification requirements for patent writing leads to lengthy and complex sentence, which results in its difficulties to parse with normal NLP tools; on the other hand, traditional approaches, NLP-based linguistic method, statistics-based machine learning method, and multimethod hybrid method  cannot catch temporal information and long sentence-level global dependency features.
In this paper, we propose an open relation extraction model of hybrid neural network to extract relation triples from patent claims, where Bi-LSTM network can obtain temporal information from the whole sentence, and CNN pooling can gain local content information; at the same time, multihead attention is incorporated into extracting content dependency feature in order to better serve for sequence label classification problems. Our main contributions are summarized as follows: (1)A hybrid neural network (Bi-LSTM+CNN+CRF) of open relation extraction (ORE) model is firstly proposed to extract none predefined triples from patent document(2)Multihead attention technique serves for better sequence label dependency classification(3)We constructed an open patent relation corpus in favor of adopting supervised approaches to ORE task in patent analysis, including 1309 annotated claims with about 29850 sentences(4)We systematically compare the performance of a series of exiting neural model in the context of the ORE in patent claims. Meanwhile, a variety of experiments help readers to better understand reliability of our hybrid model
2. Related Work
As for the traditional semantic relation extraction from patent documents, there are mainly four methods, which are NLP-based linguistic method, statistics-based machine learning method, and multimethod hybrid method.
On one hand, in the early period of semantic relation extraction from patent documents, NLP-based linguistic methods are dominant. Most of the existing methods made use of linguistic analysis. Regular expression pattern matching techniques is proposed to parse, annotate, and extract target semantic information for knowledge sharing in machine readable format OWL ; extracting hyponymy lexical relations is conducted on patent documents using lexico-syntactic patterns  and extracting knowledge combined with domain ontology from patent unstructured data . Data-intensive methods are incorporating into patent claim analysis for enhancing analysis robustness combined with symbolic grammar formalisms . Conceptual graphs are extracted from patent claims for comparing patent similarity analysis or any domain of interest . A patent processing system named PATExpert is designed for summarizing patent claims, where deep strategies of syntactic dependency relationship analysis operate on deep-syntactic structures of the claims for improving its readability . Gabriela et al.  proposed an extraction of verbal content relations from patent claims using deep syntactic structures. Fantoni et al.  proposed a method of automatically detecting and extracting information about functions, the physical behaviours, and states of the system from patent text with a large knowledge base and a series of NLP tools. Lee et al.  proposed a hierarchical keyword vector for representing the dependency relationships among claim elements and a tree matching algorithm for comparing claim elements of parents to assess patent infringement risks. Taeyeoun Roh et al.  proposed a series of rules to structure and layer technological information in patent claims through NLP tools.
On the other hand, statistical-based machine learning is frequently applied for processing patent analysis in recent years. Gabriela  proposed a two-stage method of rule-based claim paragraph segmentation and machine learning-based of conditional random field (CRF) lengthy sentence segmentation which will help automatically detect division phrases for forming meaningful shorter sentences. Wang et al.  present an approach to extracting principle knowledge from process patents classifying with contraction matrix. Okamoto et al.  proposed an information-based technique to grasp the patent claim structure through entity mention extraction and the relation extraction method with DeepDive  platform which using Markov logic network-based inference  and distant supervision-based labeling  to extract relations from unstructured text. Deng et al.  proposed to construct knowledge graph for facilitating technology transfer where common knowledge base can reveal the technical details of technical documents and assist with the identification of suitable technologies.
Besides, with the rise of deep learning technology, especially its wide application in natural language processing, hybrid technologies as above have emerged for patent mining, such as patent information extraction, patent relations extraction, and construction of patent semantic knowledge base. Yang and Soo  proposed a method to convert a patent claim into a formally defined conceptual graph with hybrid techniques of part-of-speech tags, conceptual graphs, domain ontology, and dependency tree. Korobkin et al.  proposed a hybrid methodology of LDA-based statistical and semantic text analysis to extract a physical knowledge in the form of physical effects and their practical applications. Carvalho et al.  present a hybrid method of extracting semantic information from patent claims by using semantic annotations phrasal structures, abstracting domain ontology information, and outputting ontology-friendly structures to achieve generalization. Lv et al.  proposed a hybrid method of patent terminology relation extraction combined with attention mechanism and Bi-LSTM  model to construct the patent knowledge graph.
Different from traditional relation extraction, where categories of relationships are classified at advance, open relation extraction (ORE) extract none predefined triples from unstructured text. ORE is firstly defined by Banko et al.  who proposed to extract none predefined relations from web, attracting extensive attention and follow-up researches in various fields. Del Corro and Gemulla  then proposed dependency parsing-based clause IE framework to detect and extract “useful” pieces of information clauses. Neural network are also incorporated into ORE [29, 30] with end-to-end sequence model or encoder-decoder model.
Our work is similar with Lv et al.  and [29–31], but Bi-LSTM and attention mechanism, together with open relation extraction, are firstly proposed to extract the none predefined relationship from the patent documents forming Subject-Relation-Object triples. As we believe that NLP-based parsing tools cannot catch long dependency relationship of lengthy patent sentences, different phased attention would improve the end-to-end sequence labeling classifications. We propose a hybrid neural network framework of extracting open relations from patent claims with multihead attention. Although Bidirectional Encoder Representation from Transformers (BERT) [32–35], another neural network model based on bidirectional transformer, performs excellent in a series of natural language processing tasks including sequence tagging, we would leave it for the future work.
3. Our Hybrid ORE Neural Framework
The paper proposes a supervised neural network of extracting open relations from patent claims without predefined relation categories, which enables a supervised machine learning approach to ORE in patent claims. We define the task as a sequence tagging problem, and we develop an end-to-end neural mode with Bi-LSTM and CNN with multihead attention to classify labels above. At first stage, as for the lengthy and complex structure, a machine learning-based method is used to detect segmentation word or phrases for splitting meaningful pieces of short sentences. And then word features and part-of-speech features are incorporated into the Bi-LSTM network. At then, multihead attention mechanism is applied to Bi-LSTM features for help dependency relationship label classification. Postprocessing operation is done for getting Argument1-relation-Argument2-like triples. Our neural ORE architecture is shown in Figure 1.
3.1. Task Formulation
In this paper, we define our neural ORE model as extracting triples from patent claims, where a triple often consists of a predicate and two arguments with contiguous spans from the sentence. As we show in the follow table, the formulation is defined with a more expressive BIEOS tagging scheme as shown the dashed lines, which can better capture dependency relationships from content than BIO tagging scheme. The relation phrase labels are encoded as Verb, Prep, or Noun labels type, while arguments are represented as Arg labels, where Arg1 stands for the first argument and Arg2 acts as the second argument. Several examples are shown in Table 1.
3.2. Feature Embedding
Word embedding is an operation of transforming a word token into a real-valued vector to represent syntactic and semantic information from content. Given a sentence consisting of words , every word is converted into a real-valued vector by looking up the embedding matrix , where V stands for the whole vocabulary and represents as the size of word embedding. We use Glove  as our word embedding model. Part of speech embedding is transforming POS of each word in sentence into a one-hot vector , which comes from annotated brown corpus with 36 types. Finally, the concatenation of word embedding and POS embedding is input feature of our neural model.
3.3. Bi-LSTM Network
As deep learning technology and natural language processing combine more and more closely, long short-term memory (LSTM) network, which is firstly proposed by Hochreiter and Schmidhuber in 1997 to solve gradient vanishing problem, shows its good merit on capturing long distance relationship in different NLP subtasks. The transfer diagram of adjacent units in LSTM neural network is shown in Figure 2.
The core design philosophy of LSTM is an adaptive gating mechanism, which decides the degree to which LSTM units keep the previous state and memorize the extracted features of current data input . The calculation process is as follows:
A typical LSTM network consists of four parts: one forget gate , one input gate , one current cell state , and one output gate . Through four parts of the iterative calculation above, cell units decide whether to take the inputs, forget the memory stored before, and output the state generated later. Bidirectional LSTM network is the combination of forward LSTM networks and backward LSTM networks, where the hidden layer of the latter network flows in opposite position as that of the former, which can capture the future information as well as the past one. Thus, the Bi-LSTM model is able to exploit information both from the past and the future, more suitable for the sequence tagging model tasks. In this paper, we use the Bi-LSTM model to obtain the semantic and syntactic information from the sentence, and we get the combined hidden information with element-wise sum operation as the following equation from two subnetworks of the forward hidden state and backward hidden state.
3.4. Multihead Attention
Attention mechanism has now become a predominant concept in neural network literature in recent years and has received varying degrees of attention and research within the artificial intelligence (AI) community in a large number of applications, such as speech recognition, computer vision, natural language processing, and statistical learning. In this paper, we adopt the multihead attention, which has shown excellent performance in many tasks, such as reading comprehension  (Cheng et al., 2016), text inheritance  (textual ailment/Parikh et al., 2016), and automatic text summary  (Paulus et al., 2017). The essence of multihead attention is to do multiple calculations of self-attention, which can enable sequence-to-sequence neural model to obtain more features from different representation subspaces, so that the model can capture more context information of sentences. The relevant attention equations are described as below: where , , and represent query matrix, key matrix, and value matrix of the multihead attention mechanism, and in the above equations, , , , and . For each head attention, we compute the attention weight by Equation (2), and finally, we concatenate each head as output results of attention layer.
3.5. CNN Network
Convolutional neural networks (CNN) is a good means of capturing salient local features from whole sequence as for its capability of learning local semantic patterns by its flexible convolutional structure in multidimensional feature extraction . Convolution is often thought of as the product of a weight vector and a sequence vector. The weights matrix is regarded as the filter for the convolution . Given various convolution window length, different outputs are fed to a max-pooling layer, where we can get a feature vector of fixed length.
3.6. CRF Layer
The output of the softmax layer does not affect each other and is independent of each other, while Bi-LSTM can learn semantic and syntactic information about the content. But as for some tasks, such as Noun chunking and Named Entity Recognition (NER), output labels are mutually restrictive. Taking “ aB_Arg1 flexibleI_Arg1 pressI_Arg1 plateE_Arg1” for an example, label B_Arg1 must be in front of I_Arg1, and label E_Arg1 must come after label B_Arg1 and I_Arg1, and other sequence is illegal. And the result label calculation of the CRF layer is realized by dynamic programming optimization, which would obviously outperform the model without the CRF layer int the time series estimation problem.
We extract 1309 claims from patent documents form USPTO and annotate the claims with thirty undergraduates for about 2 months. The constructed dataset finally contains 29850 sentences, where 60% for training, 20% for verification, and 20% for test. For argument1 and argument2, we use BIEOS label mechanism, which is also suitable for relation phrase labels. There are three relationships in the whole labeled dataset, and each relationship contains a single tag “S” or two more tags “BE” or “BIE.” The statistics of all labels are shown in Table 1. Finally, we evaluate our patent ORE mode on above dataset. The results are measured by Precision (P), Recall (R), and F1-score, which is defined in Table 2.
4.2. Hyperparameters Setting
We implement our model with python 3.5 in Keras on NVIDIA Quadro P2000. Adam method is used to optimize our model, learning rate is set to 0.01, and batch size is 50. For multihead attention, we set the number of attention heads is 4, and we use Glove as word embedding model, and the dimension of word vectors is set as 300. Part of speech embedding size is one-hot vector and is set as 26, and relation label embedding size is also set as 12. The dropout rate is set to 0.1 to prevent overfitting, and L2 regularization is also employed in training to prevent overfitting. The max length of the sentence is set as 100. The detailed parameters of the framework are shown in the following Table 3.
4.3. Experiments and Discussion
In our model, label embedding of our hybrid neural network model consists of word embedding, part-of-speech embedding, and relation tagging embedding. More information feature would be incorporated into the embedding layer though the concatenation by the last dimension for each word. The attention mechanism used in our model is multihead attention, which layer is followed by CNN layer. From a series of experiments in Table 4 above, we obviously conclude that hybrid neural network model performs better than traditional neural network model like Bi-LSTM, such as model 1and model 2 in Table 4, and neural network models with the help of label embedding obviously perform better than the models without the label embedding, such as model 3 and the model 1, in the evaluation measures of Precision, Recall, and F1 score. Through the comparison with the other neural models, our model with multihead attention outperforms other model as well.
4.4. Conclusion and Future Work
In this paper, we propose a Patent Open Relation Extraction neural model. Instead of employing feature engineering, we use a hybrid Bi-LSTM+CNN+CRF neural model with multihead attention mechanism. The hybrid model outperforms the single other model obviously on our self-constructed patent sequence tagging dataset. In the future, we consider incorporating the transform model into our model, such as Bidirectional Encoder Representation from Transformers (BERT), and we also consider patent domain word embedding, which we think would potentially improve the performance.
The dataset used to support the findings of this study have not been made available as the dataset also forms part of an ongoing study.
Conflicts of Interest
The author declares that they have no conflicts of interest.
C. J. Fall, A. Törcsvári, K. Benzineb, and G. Karetka, “Automatedcategorizationin the international patent classification,” ACM SIGIR Forum, vol. 37, no. 1, pp. 10–25, 2003.View at: Google Scholar
X. Lv, X. Lv, X. You, Z. Dong, and J. Han, “Relation extraction toward patent domain based on keyword strategy and attention+BiLSTM model,” in Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2019, vol. 292 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pp. 408–416, Springer, 2019.View at: Publisher Site | Google Scholar
S.-Y. L. Shi-Yao, S.-N. Lin, C.-F. Lee, S.-L. Cheng, and V.-W. Soo, “Automatic extraction of semantic relations from patent claims,” International Journal of Electronic Business Management, vol. 6, no. 1, pp. 45–54, 2008.View at: Google Scholar
L. Andersson, M. L. J. Pallotti, F. Piroi, A. Hanbury, and A. Rauber, in Proceedings of the First International Workshop on Patent Mining and Its Applications (IPAMIN), Hildesheim, Germany, 2014.
N. Bouayad-Agha, G. Casamayor, G. Ferraro, S. Mille, V. Vidal, and L. Wanner, Improving the Comprehension of Legal Documentation: The Case of Patent Claims, ICAIL, Barcelona, Spain., 2009.
G. Ferraro and L. Wanner, “Labeling semantically motivated clusters of verbal relations,” Procesamiento del Lenguaje Natural, vol. 49, pp. 129–138, 2012.View at: Google Scholar
G. Ferraro, H. Suominen, and J. Nualart, “Segmentation of patent claims for improving their readability,” in Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pp. 66–73, Gothenburg, Sweden, 2014.View at: Publisher Site | Google Scholar
J. Shin, F. W. SenWu, C. De Sa, C. Zhang, and C. Ré, “Incremental knowledge base construction using DeepDive,” Proceedings of the VLDB Endowment, vol. 8, no. 11, 2015.View at: Google Scholar
M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09, pp. 1003–1011, Singapore, 2009.View at: Publisher Site | Google Scholar
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “IJCAI,” in Proceedings of the 20th international joint conference on artificial intelligence, pp. 355–366, Hyderabad, India, 2007.View at: Google Scholar
L. Del Corro and R. Gemulla, ClausIE: Clause-Based Open Information Extraction, ACM, Rio de Janeiro, Brazil, 2013.
G. Stanovsky, J. Michael, L. Zettlemoyer, and I. Dagan, “Supervised open information extraction,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 885–895, Melbourne, Australia ACL, 2018.View at: Publisher Site | Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems., pp. 5998–6008, 2017.View at: Google Scholar
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng, “Improving word representations via global context and multiple word prototypes,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 873–882, Minneapolis, MIN, USA, 2012.View at: Google Scholar
D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, “Relation classification via convolutional deep neural network,” in 25th International Conference on Computational Linguistics COLING 2014, pp. 2335–2344, Long Beach City, CA, USA, 2014.View at: Google Scholar
P. Zhou, W. Shi, J. Tian et al., “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 207–212, Berlin, Germany, 2016.View at: Publisher Site | Google Scholar
J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” EMNLP, vol. 2016, pp. 551–561, 2016.View at: Google Scholar
D. Zhang and W. Dong, “Relation classification: CNN or RNN? NLPCC-ICCPOL 2016,” LNAI, vol. 10102, pp. 665–675, 2016.View at: Google Scholar