Research Article

PQPS: Prior-Art Query-Based Patent Summarizer Using RBM and Bi-LSTM

Table 2

Patent features for RBM-EPS.

Feature with description

Title and search query similarity: patent title usually reflects the innovativeness or the main theme. This similarity feature helps to retain the sentences that are relevant and related to the patent title and search query provided by patent analyst. To compute them, both co-occurrences-based [40] and similarity-based features are incorporated.
,
—set of words in title, —set of words in search query, —set of words in sentence, —title length, —search query length, and and are computed using cosine similarity.

Sentence field position: the first and last sentences in a paragraph or section provide meaningful and prominent information to the reader [68] and therefore it is likely to be part of the summary.

Term frequency-inverse field frequency (TF-IFF): this measure identifies sentences that have important noun phrases of the form (JJNN.+IN)?JJNN.∗+at a particular field in the document. Here JJ denotes an adjective, NN represents noun phrases, and IN points to preposition or subordinating conjunction. —term frequency of term t in the field , —inverse field frequency of the field , —No. of occurrences of term in the field , is the size of the field , is the number of fields in the document, and represents the field frequency.

Term frequency-inverse concept frequency (TF-ICF): this is more like TF-IFF where it measures importance of noun phrases to the concepts in smart device ontology hierarchy rather than field wise. There are chances where the term in the corpus does not belong to any concepts. In that case, its ICF value is assigned to 1.
is the total number of concepts in the smart device ontology is the total number of concepts in the smart device ontology and presents the concept frequency with respect to term.

Sentence length: generally in patents, the sentences are long. To discard very short sentences and very long sentences from the summary, this feature is included. retrieves sentences that are close to the mean length of sentences in the document. .
refers to number of terms in sentence , denotes the mean of all sentences in a document, and is the standard deviation.

Sentence centrality: this measure retains sentences that are close to each other. This sentence centrality is computed by considering unigrams and bigrams and average score is treated as feature score.

Thematic words: thematic words are the related words with the topic of the document and their frequency will be higher. Sentences with these words indicate that they are informative. The top 20 most frequent phrases are chosen as thematic words and the sentence score is calculated accordingly.