Abstract

Keyword extraction refers to the process of selecting most significant, relevant, and descriptive terms as keywords, which are present inside a single document. Keyword extraction has major applications in the information retrieval domain, such as analysis, summarization, indexing, and search, of documents. In this paper, we present a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extends the concept of Contextual Semantic Smoothing (CSS), which considers term usage patterns in similar texts to improve term relevance information. We introduce four more features beyond CSS as our novel contributions in this work. We systematically compare the performance of CCSS with other techniques, when implemented over INSPEC dataset, where CCSS outperforms all state-of-the-art keyphrase extraction techniques presented in the literature.

1. Introduction

Keyword extraction can be defined as the process of selecting most significant, relevant, and descriptive terms as keywords, which are present inside a single document, where “terms” refer to distinct n-grams of any size. Keywords represent distinguished and specialized concepts and are bound to convey the informational content load of a document. Keyword extraction has major applications in the information retrieval domain, such as summarization [1, 2], indexing [3], search [4], tagging [5, 6], contextual advertising [7, 8], and personalized recommendation [9].

Documents can generally be classified into long-, medium-, and short-sized documents, where webpages, news articles, and research papers represent long-sized documents, research papers’ abstracts, emails, and question-and-answer conversations characterize medium-sized documents, while microposts and Short Message Service (SMS) denote short-sized documents. Each type of document possesses unique characteristics and challenges that need to be dealt with before any keyword extraction technique can be successfully applied on it. Long-sized documents comprise large vocabulary, medium-sized documents include lack of context, while short-sized documents contain challenges related to low signal-to-noise ratio, extensive preprocessing, and multivaried text composition [10].

Replacing author-assigned keywords in research papers’ abstracts, topic identification of emails, and topic recommendation for question-and-answer conversations are a few significant applications of keyword extraction from medium-sized documents in the real world.

A research paper abstract can provide a user the summary of the respective research article, in absence of his/her access to the latter. Hence, keywords extracted from research abstracts would represent the ones extracted from respective research articles. Also, research papers contain keywords that are manually tagged by respective authors. Manually tagged keywords contain bias that helps respective research papers to appear in top results, when searched by users utilizing those index terms. This can be observed by looking at examples of ACM (https://www.acm.org/) and IEEE (https://www.ieee.org/), both of which are leading research organizations in the domains of Computer Science and Engineering, respectively, and hence possess a majority of authors in these domains, when considered together. For ACM, authors need to provide Computing Classification System (CCS) (https://dl.acm.org/ccs) Concepts that are defined by ACM, but also keywords that are defined by authors. For IEEE, authors need to provide their own defined keywords as index terms. Upon automatic selection, keywords or index terms should exclude associated bias in the search process upto a certain level, e.g., in terms of (see Section 4.2).

Keyword extraction has been performed in the literature on all types of documents (long-sized [11], medium-sized [12], and short-sized [13]), while utilizing various techniques. Keyword extraction techniques developed so far have been either supervised [14] or unsupervised [15]. Unsupervised techniques can be used on multiple document collections without the need for costly and time-consuming prior labeling. On the other hand, supervised techniques although require periodic training from human-labeled document collections, they still can be more accurate [16, 17].

In this paper, we present a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extends the concept of Contextual Semantic Smoothing (CSS) [10], which considers term usage patterns in similar texts to improve term relevance information for short-sized documents. In fact, CSS performs smoothing of the TFIDF matrix using a semantic feature, namely Phi coefficient, while keeping the corpus context into consideration. We introduce four more features beyond CSS as our novel contributions in this work in order to handle further challenges associated with medium-sized documents.

PageRank is a graph-based unsupervised language-independent ranking algorithm, presented by Page et al. [18], which uses link information to iteratively assign global importance scores to webpages. PageRank is based upon the principle: “A vertex is important if there are other important vertices pointing to it,” which can be regarded as voting or recommendation among vertices. In PageRank for keyword extraction, the ranking score of a candidate keyword is computed by summing up the ranking scores of all unigrams within the keyword [1921]. Then, candidate keywords are ranked in descending order of ranking scores, and the top candidates are selected as keywords.

Various methods have been proposed in the literature to infer latent topics of words and documents. These methods are known as latent topic models that derive latent topics from a large-scale document collection according to word occurrence information. Latent Dirichlet allocation (LDA), developed by Blei et al. [22] is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. It is representative of the latent topic models, embeds supervised learning, has more feasibility for inference, and can reduce the risk of overfitting.

Topical PageRank (TPR), proposed by Liu et al. [23], is based upon PageRank [18], which measures the importance of a word with respect to different topics. Given the topic distribution of a document, ranking scores of words are calculated with respect to those topics, and top ranked words for each topic are extracted as its keywords, thus resulting in a good coverage of the document’s major topics. TPR combines the advantages of both LDA and TFIDF/PageRank, by utilizing both external topic information (like LDA) and internal document structure (like TFIDF/PageRank).

Liu et al. [24] devised an unsupervised technique for keyword extraction, which first finds exemplar terms in a document by leveraging, clustering, and semantic relatedness, which guarantees the document to be semantically covered by these exemplar terms (centroids of clusters). Then, keywords are extracted from the document using these exemplar terms. The technique incorporates term cooccurrence information and considers Noun Phrases only for keyword candidates.

Tsatsaronis et al. [25] designed SemanticRank, which is again based upon PageRank [18], but ranks both keywords and sentences in a document based on their respective relevance to the document. The technique constructs a semantic graph using terms as nodes, and their implicit links while utilizing Omiotis similarity measure, WordNet, and Wikipedia as knowledge-bases and statistical information.

3. Corpus-Based Contextual Semantic Smoothing for Medium-Sized Documents

Given a collection of medium-sized documents and domain-specific information (stopword and standardization lists), a keyword extraction technique outputs the top keywords from a document . We divide our methodology into two phases, namely keyword extraction (unigrams) and keyphrase extraction (-grams, where ). First, all experiments were conducted to optimize the process of keyword extraction, and then the parameters were revisited to optimize the process of keyphrase extraction.

Corpus-based contextual semantic smoothing (CCSS, see Figure 1) extends the concept of Contextual Semantic Smoothing (CSS) [10], which considers term usage patterns in similar texts to improve term relevance information. In fact, CSS performs smoothing of the TFIDF matrix using a semantic feature, namely Phi coefficient, while keeping the corpus context into consideration.

3.1. Parts of Speech Tagging

In the literature, different combinations of Parts of Speech (POS) have been employed in order to filter unlikely keywords from a document, as presented in Table 1.

As the first feature, we experimented with various combinations of POS (including some mentioned in Table 1), and selected the combination that considered all POS except Modal Verbs, as candidate keywords in a document. Modal Verbs are auxiliary verbs, such as “can” or “will,” which are used to express modality. This combination of POS has not been used in the literature before, as evident from Table 1. Experiments related to this feature are presented in Section 5.1.

3.2. Labeling Corpus

As the second feature, we utilized a corpus consisting only of labels. We state our hypothesis as “A term should be considered as candidate keyword in a document, if it has been assigned as a label atleast once in the labeling corpus.” We acquired INSPEC (https://www.theiet.org/publishing/inspec/) and ACM (https://doc.novay.nl/dsweb/Get/Document-115737/ACM-URLs.txt) collections to combine all of their labels into a single corpus. Both INSPEC and ACM collections consist of abstracts in English from scientific journal papers. Further details about the corpora are provided in Section 4.1. We experimented with various frequencies of terms assigned as labels in the labeling corpus, and finally found our hypothesis to be true. In the literature, corpora have been utilized as a feature [1, 28, 3944], however both labeling corpus in general, and this combination of corpora in particular, have not been used earlier. Experiments related to this feature are presented in Section 5.2.

3.3. Ratio Metric

As the third feature, we introduced a novel metric for each term’s eligibility for being a candidate keyword.where represents frequency of in the source document under consideration, represents frequency of in the labeling corpus under consideration, and represents a threshold value under which the ratio of and should remain in order for to be considered as a candidate keyword. The motivation behind developing this metric was to filter those terms as candidate keywords whose . Experiments related to this feature are presented in Section 5.3.

3.4. Keyphrase Extraction

Once we had identified the significant keywords in the first phase, we moved on towards forming significant keyphrases in the second phase, through four different combinations of the two phases.

First, we considered the simplest way where all adjacently located keywords in were utilized to form keyphrases.

Second, for all adjacently located keywords in each , we selected the Top- ( is an integer between and ) keyphrases from them as significant keyphrases in order to take into consideration the varying sizes of documents.

Third, similar to selecting Top- keyphrases in each as significant keyphrases, we revisited and improved the keyword extraction process by selecting Top- ( is an integer between and ) keywords in each as its significant keywords, and then selecting all adjacently located keywords in each as significant keyphrases.

Fourth, we first selected Top- (same value as resulted from the third combination of the two phases) keywords in each as its significant keywords, and then selected Top- ( is an integer between and ) keyphrases in each as significant keyphrases. In the literature, keywords have been selected using the Top- metric; however, the process of Top- keyphrases’ selection after Top- keywords have been selected has not been proposed earlier. Experiments related to the combinations of the two phases are presented in Section 5.4.

4. Data Analysis and Experimental Setup

4.1. Data Analysis

INSPEC dataset contains abstracts in English of journal papers from the disciplines of Computers and Control, and Information Technology, from to , and is a collection of documents. The keywords assigned by a professional indexer may or may not be present in the abstracts. However, the indexers had access to the full-length documents when assigning the keywords. The abstracts in this dataset contain two sections; Title and Abstract, while in this work our focus is on the Abstract section only. All experiments presented in Section 5 have been conducted over this dataset.

ACM dataset contains abstracts in English of journal, conference, and workshop papers published by ACM in four domains of Computer Science, Distributed Systems, Information Search and Retrieval, Learning, and Social and Behavioral Sciences, and it consists of a total of documents. This dataset has only been used to create a labeling corpus (see Section 3.2).

4.2. Experimental Setup

The following evaluation metrics will be employed in the experiments:(i) is the fraction of relevant instances among the retrieved instances.where and denote true positives and false positives, respectively.(ii) is the fraction of relevant instances that were retrieved.where denotes false negatives.(iii) is the harmonic mean of and .

5. Experimental Results and Discussion

We follow experiments in the same sequence as mentioned in Section 3.

5.1. POS Tagging

As discussed in Section 3.1, Table 2 presents different combinations of POS experimented for the task of keyword extraction.

Here, NO = Nouns, AD = Adjectives, F = Foreign Words, I = Irrelevant Terms, V = Verbs, NU = Numbers, G = Genitive Markers, AG = Agents, and MV = Modal Verbs.

Foreign Words include non-English words, and Irrelevant Terms are represented by a union of all prepositions, conjunctions, determiners, possessive pronouns, particles, adverbs, and interjections [45], while Genitive Markers show ownership, measurement, association, or source, e.g., “boy’s” and “of the boy.”

Although the combination of POS selected for our methodology ranks fourth among the different ones experimented, we, for obvious reasons, avoided those combinations that included either foreign words or irrelevant terms.

5.2. Labeling Corpus

As discussed in Section 3.2, Table 3 displays various frequencies of terms assigned as labels in the labeling corpus, which were experimented for the task of keyword extraction.

5.3. Ratio Metric

As discussed in Section 3.3, we experimented with different threshold values for x for the task of keyword extraction, and found x= 5 to be the optimal value in terms of F-measure, as mentioned in Table 4.

All results related to different stages for the process of keyword extraction are summarized in Table 5, as discussed in Sections 3.13.3.

Here, F1, F2, and F3 represent the POS Tagging, Labeling Corpus, and Ratio Metric features, respectively.

5.4. Keyphrase Extraction

As discussed in Section 3, the optimal values yielded for the first three features for the process of keyword extraction were then revisited to yield the optimal values for the process of keyphrase extraction. Although the same optimal values were yielded for the first two features, the Ratio Metric feature produced an optimal value at x= 8, as mentioned in Table 6, and also reflected in Tables 4 and 5.

This is the simplest combination of keyword extraction and keyphrase extraction processes where all adjacently located keywords in d were utilized to form keyphrases.

As discussed in Section 3.4, for our second combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found = 55 to be the optimal value in terms of F-measure, as mentioned in Table 7.

As discussed in Section 3.4, for our third combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found  = 59 to be the optimal value in terms of F-measure, as mentioned in Table 8.

As discussed in Section 3.4, for our fourth combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found = 55 to be the optimal value in terms of F-measure, as mentioned in Table 9.

All results related to different combinations of keyword extraction and keyphrase extraction processes are summarized in Table 10, as discussed in Section 3.4.

5.5. CCSS Vs. State-of-the-Art Techniques

We systematically compared the performance of CCSS with other techniques, when implemented over INSPEC dataset, as presented in the literature, and such analysis is presented in Table 11. It is clear that CCSS has outperformed all state-of-the-art keyphrase extraction techniques presented in the literature.

6. Conclusion and Future Work

In this paper, we have presented a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extended the concept of Contextual Semantic Smoothing (CSS), which considered term usage patterns in similar texts to improve term relevance information. We introduced four more features beyond CSS as our novel contributions in this work. We systematically compared the performance of CCSS with other techniques, when implemented over INSPEC dataset, where CCSS clearly outperformed all state-of-the-art keyphrase extraction techniques presented in the literature.

Our future work includes utilizing CCSS in the applications of indexing and search, summarization, and multilingual summarization, of medium-sized documents. We are also currently working on compiling the literature review for all keyword extraction-based applications beyond and including the abovementioned ones.

Data Availability

Previously reported INSPEC and ACM datasets were used to support this study and are available at https://www.theiet.org/publishing/inspec/and https://www.innovalor.nl/, respectively. The datasets used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge both Mohammad Ali Jinnah University (MAJU), Karachi, Pakistan, and Deanship of Research, Islamic University of Madinah, Madinah, Kingdom of Saudi Arabia, on the support provided for this research. This research was funded by Deanship of Research, Islamic University of Madinah, Madinah, Kingdom of Saudi Arabia.