Keyword Extraction for Medium-Sized Documents Using Corpus-Based Contextual Semantic Smoothing

Khan, Osama A.; Wasi, Shaukat; Siddiqui, Muhammad Shoaib; Karim, Asim

doi:https://doi.org/10.1155/2022/7015764

Complexity

On this page

Abstract Introduction Related Work Experimental Results and Discussion Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Complexity and Robustness Trade-Off for Traditional and Deep Models 2022

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 7015764 | https://doi.org/10.1155/2022/7015764

Keyword Extraction for Medium-Sized Documents Using Corpus-Based Contextual Semantic Smoothing

Osama A. Khan,¹Shaukat Wasi,¹Muhammad Shoaib Siddiqui,²and Asim Karim³

Academic Editor: Muhammad Ahmad

Received07 Jun 2022

Revised02 Aug 2022

Accepted09 Aug 2022

Published29 Sept 2022

Abstract

Keyword extraction refers to the process of selecting most significant, relevant, and descriptive terms as keywords, which are present inside a single document. Keyword extraction has major applications in the information retrieval domain, such as analysis, summarization, indexing, and search, of documents. In this paper, we present a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extends the concept of Contextual Semantic Smoothing (CSS), which considers term usage patterns in similar texts to improve term relevance information. We introduce four more features beyond CSS as our novel contributions in this work. We systematically compare the performance of CCSS with other techniques, when implemented over INSPEC dataset, where CCSS outperforms all state-of-the-art keyphrase extraction techniques presented in the literature.

1. Introduction

Keyword extraction can be defined as the process of selecting most significant, relevant, and descriptive terms as keywords, which are present inside a single document, where “terms” refer to distinct n-grams of any size. Keywords represent distinguished and specialized concepts and are bound to convey the informational content load of a document. Keyword extraction has major applications in the information retrieval domain, such as summarization [1, 2], indexing [3], search [4], tagging [5, 6], contextual advertising [7, 8], and personalized recommendation [9].

Documents can generally be classified into long-, medium-, and short-sized documents, where webpages, news articles, and research papers represent long-sized documents, research papers’ abstracts, emails, and question-and-answer conversations characterize medium-sized documents, while microposts and Short Message Service (SMS) denote short-sized documents. Each type of document possesses unique characteristics and challenges that need to be dealt with before any keyword extraction technique can be successfully applied on it. Long-sized documents comprise large vocabulary, medium-sized documents include lack of context, while short-sized documents contain challenges related to low signal-to-noise ratio, extensive preprocessing, and multivaried text composition [10].

Replacing author-assigned keywords in research papers’ abstracts, topic identification of emails, and topic recommendation for question-and-answer conversations are a few significant applications of keyword extraction from medium-sized documents in the real world.

A research paper abstract can provide a user the summary of the respective research article, in absence of his/her access to the latter. Hence, keywords extracted from research abstracts would represent the ones extracted from respective research articles. Also, research papers contain keywords that are manually tagged by respective authors. Manually tagged keywords contain bias that helps respective research papers to appear in top results, when searched by users utilizing those index terms. This can be observed by looking at examples of ACM (https://www.acm.org/) and IEEE (https://www.ieee.org/), both of which are leading research organizations in the domains of Computer Science and Engineering, respectively, and hence possess a majority of authors in these domains, when considered together. For ACM, authors need to provide Computing Classification System (CCS) (https://dl.acm.org/ccs) Concepts that are defined by ACM, but also keywords that are defined by authors. For IEEE, authors need to provide their own defined keywords as index terms. Upon automatic selection, keywords or index terms should exclude associated bias in the search process upto a certain level, e.g., in terms of (see Section 4.2).

Keyword extraction has been performed in the literature on all types of documents (long-sized [11], medium-sized [12], and short-sized [13]), while utilizing various techniques. Keyword extraction techniques developed so far have been either supervised [14] or unsupervised [15]. Unsupervised techniques can be used on multiple document collections without the need for costly and time-consuming prior labeling. On the other hand, supervised techniques although require periodic training from human-labeled document collections, they still can be more accurate [16, 17].

In this paper, we present a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extends the concept of Contextual Semantic Smoothing (CSS) [10], which considers term usage patterns in similar texts to improve term relevance information for short-sized documents. In fact, CSS performs smoothing of the TFIDF matrix using a semantic feature, namely Phi coefficient, while keeping the corpus context into consideration. We introduce four more features beyond CSS as our novel contributions in this work in order to handle further challenges associated with medium-sized documents.

PageRank is a graph-based unsupervised language-independent ranking algorithm, presented by Page et al. [18], which uses link information to iteratively assign global importance scores to webpages. PageRank is based upon the principle: “A vertex is important if there are other important vertices pointing to it,” which can be regarded as voting or recommendation among vertices. In PageRank for keyword extraction, the ranking score of a candidate keyword is computed by summing up the ranking scores of all unigrams within the keyword [19–21]. Then, candidate keywords are ranked in descending order of ranking scores, and the top candidates are selected as keywords.

Various methods have been proposed in the literature to infer latent topics of words and documents. These methods are known as latent topic models that derive latent topics from a large-scale document collection according to word occurrence information. Latent Dirichlet allocation (LDA), developed by Blei et al. [22] is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. It is representative of the latent topic models, embeds supervised learning, has more feasibility for inference, and can reduce the risk of overfitting.

Topical PageRank (TPR), proposed by Liu et al. [23], is based upon PageRank [18], which measures the importance of a word with respect to different topics. Given the topic distribution of a document, ranking scores of words are calculated with respect to those topics, and top ranked words for each topic are extracted as its keywords, thus resulting in a good coverage of the document’s major topics. TPR combines the advantages of both LDA and TFIDF/PageRank, by utilizing both external topic information (like LDA) and internal document structure (like TFIDF/PageRank).

Liu et al. [24] devised an unsupervised technique for keyword extraction, which first finds exemplar terms in a document by leveraging, clustering, and semantic relatedness, which guarantees the document to be semantically covered by these exemplar terms (centroids of clusters). Then, keywords are extracted from the document using these exemplar terms. The technique incorporates term cooccurrence information and considers Noun Phrases only for keyword candidates.

Tsatsaronis et al. [25] designed SemanticRank, which is again based upon PageRank [18], but ranks both keywords and sentences in a document based on their respective relevance to the document. The technique constructs a semantic graph using terms as nodes, and their implicit links while utilizing Omiotis similarity measure, WordNet, and Wikipedia as knowledge-bases and statistical information.

3. Corpus-Based Contextual Semantic Smoothing for Medium-Sized Documents

Given a collection of medium-sized documents and domain-specific information (stopword and standardization lists), a keyword extraction technique outputs the top keywords from a document . We divide our methodology into two phases, namely keyword extraction (unigrams) and keyphrase extraction (-grams, where ). First, all experiments were conducted to optimize the process of keyword extraction, and then the parameters were revisited to optimize the process of keyphrase extraction.

Corpus-based contextual semantic smoothing (CCSS, see Figure 1) extends the concept of Contextual Semantic Smoothing (CSS) [10], which considers term usage patterns in similar texts to improve term relevance information. In fact, CSS performs smoothing of the TFIDF matrix using a semantic feature, namely Phi coefficient, while keeping the corpus context into consideration.

3.1. Parts of Speech Tagging

In the literature, different combinations of Parts of Speech (POS) have been employed in order to filter unlikely keywords from a document, as presented in Table 1.

As the first feature, we experimented with various combinations of POS (including some mentioned in Table 1), and selected the combination that considered all POS except Modal Verbs, as candidate keywords in a document. Modal Verbs are auxiliary verbs, such as “can” or “will,” which are used to express modality. This combination of POS has not been used in the literature before, as evident from Table 1. Experiments related to this feature are presented in Section 5.1.

3.2. Labeling Corpus

As the second feature, we utilized a corpus consisting only of labels. We state our hypothesis as “A term should be considered as candidate keyword in a document, if it has been assigned as a label atleast once in the labeling corpus.” We acquired INSPEC (https://www.theiet.org/publishing/inspec/) and ACM (https://doc.novay.nl/dsweb/Get/Document-115737/ACM-URLs.txt) collections to combine all of their labels into a single corpus. Both INSPEC and ACM collections consist of abstracts in English from scientific journal papers. Further details about the corpora are provided in Section 4.1. We experimented with various frequencies of terms assigned as labels in the labeling corpus, and finally found our hypothesis to be true. In the literature, corpora have been utilized as a feature [1, 28, 39–44], however both labeling corpus in general, and this combination of corpora in particular, have not been used earlier. Experiments related to this feature are presented in Section 5.2.

3.3. Ratio Metric

As the third feature, we introduced a novel metric for each term’s eligibility for being a candidate keyword.where represents frequency of in the source document under consideration, represents frequency of in the labeling corpus under consideration, and represents a threshold value under which the ratio of and should remain in order for to be considered as a candidate keyword. The motivation behind developing this metric was to filter those terms as candidate keywords whose . Experiments related to this feature are presented in Section 5.3.

3.4. Keyphrase Extraction

Once we had identified the significant keywords in the first phase, we moved on towards forming significant keyphrases in the second phase, through four different combinations of the two phases.

First, we considered the simplest way where all adjacently located keywords in were utilized to form keyphrases.

Second, for all adjacently located keywords in each , we selected the Top- ( is an integer between and ) keyphrases from them as significant keyphrases in order to take into consideration the varying sizes of documents.

Third, similar to selecting Top- keyphrases in each as significant keyphrases, we revisited and improved the keyword extraction process by selecting Top- ( is an integer between and ) keywords in each as its significant keywords, and then selecting all adjacently located keywords in each as significant keyphrases.

Fourth, we first selected Top- (same value as resulted from the third combination of the two phases) keywords in each as its significant keywords, and then selected Top- ( is an integer between and ) keyphrases in each as significant keyphrases. In the literature, keywords have been selected using the Top- metric; however, the process of Top- keyphrases’ selection after Top- keywords have been selected has not been proposed earlier. Experiments related to the combinations of the two phases are presented in Section 5.4.

4. Data Analysis and Experimental Setup

4.1. Data Analysis

INSPEC dataset contains abstracts in English of journal papers from the disciplines of Computers and Control, and Information Technology, from to , and is a collection of documents. The keywords assigned by a professional indexer may or may not be present in the abstracts. However, the indexers had access to the full-length documents when assigning the keywords. The abstracts in this dataset contain two sections; Title and Abstract, while in this work our focus is on the Abstract section only. All experiments presented in Section 5 have been conducted over this dataset.

ACM dataset contains abstracts in English of journal, conference, and workshop papers published by ACM in four domains of Computer Science, Distributed Systems, Information Search and Retrieval, Learning, and Social and Behavioral Sciences, and it consists of a total of documents. This dataset has only been used to create a labeling corpus (see Section 3.2).

4.2. Experimental Setup

The following evaluation metrics will be employed in the experiments:(i) is the fraction of relevant instances among the retrieved instances. where and denote true positives and false positives, respectively.(ii) is the fraction of relevant instances that were retrieved. where denotes false negatives.(iii) is the harmonic mean of and .

5. Experimental Results and Discussion

We follow experiments in the same sequence as mentioned in Section 3.

5.1. POS Tagging

As discussed in Section 3.1, Table 2 presents different combinations of POS experimented for the task of keyword extraction.

Here, NO = Nouns, AD = Adjectives, F = Foreign Words, I = Irrelevant Terms, V = Verbs, NU = Numbers, G = Genitive Markers, AG = Agents, and MV = Modal Verbs.

Foreign Words include non-English words, and Irrelevant Terms are represented by a union of all prepositions, conjunctions, determiners, possessive pronouns, particles, adverbs, and interjections [45], while Genitive Markers show ownership, measurement, association, or source, e.g., “boy’s” and “of the boy.”

Although the combination of POS selected for our methodology ranks fourth among the different ones experimented, we, for obvious reasons, avoided those combinations that included either foreign words or irrelevant terms.

5.2. Labeling Corpus

As discussed in Section 3.2, Table 3 displays various frequencies of terms assigned as labels in the labeling corpus, which were experimented for the task of keyword extraction.

5.3. Ratio Metric

As discussed in Section 3.3, we experimented with different threshold values for x for the task of keyword extraction, and found x = 5 to be the optimal value in terms of F-measure, as mentioned in Table 4.

All results related to different stages for the process of keyword extraction are summarized in Table 5, as discussed in Sections 3.1–3.3.

Here, F1, F2, and F3 represent the POS Tagging, Labeling Corpus, and Ratio Metric features, respectively.

5.4. Keyphrase Extraction

As discussed in Section 3, the optimal values yielded for the first three features for the process of keyword extraction were then revisited to yield the optimal values for the process of keyphrase extraction. Although the same optimal values were yielded for the first two features, the Ratio Metric feature produced an optimal value at x = 8, as mentioned in Table 6, and also reflected in Tables 4 and 5.

This is the simplest combination of keyword extraction and keyphrase extraction processes where all adjacently located keywords in d were utilized to form keyphrases.

As discussed in Section 3.4, for our second combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found = 55 to be the optimal value in terms of F-measure, as mentioned in Table 7.

As discussed in Section 3.4, for our third combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found = 59 to be the optimal value in terms of F-measure, as mentioned in Table 8.

As discussed in Section 3.4, for our fourth combination of keyword extraction and keyphrase extraction processes, we experimented with different values for , and found = 55 to be the optimal value in terms of F-measure, as mentioned in Table 9.

All results related to different combinations of keyword extraction and keyphrase extraction processes are summarized in Table 10, as discussed in Section 3.4.

5.5. CCSS Vs. State-of-the-Art Techniques

We systematically compared the performance of CCSS with other techniques, when implemented over INSPEC dataset, as presented in the literature, and such analysis is presented in Table 11. It is clear that CCSS has outperformed all state-of-the-art keyphrase extraction techniques presented in the literature.

6. Conclusion and Future Work

In this paper, we have presented a novel supervised technique for extraction of keywords from medium-sized documents, namely Corpus-based Contextual Semantic Smoothing (CCSS). CCSS extended the concept of Contextual Semantic Smoothing (CSS), which considered term usage patterns in similar texts to improve term relevance information. We introduced four more features beyond CSS as our novel contributions in this work. We systematically compared the performance of CCSS with other techniques, when implemented over INSPEC dataset, where CCSS clearly outperformed all state-of-the-art keyphrase extraction techniques presented in the literature.

Our future work includes utilizing CCSS in the applications of indexing and search, summarization, and multilingual summarization, of medium-sized documents. We are also currently working on compiling the literature review for all keyword extraction-based applications beyond and including the abovementioned ones.

Data Availability

Previously reported INSPEC and ACM datasets were used to support this study and are available at https://www.theiet.org/publishing/inspec/and https://www.innovalor.nl/, respectively. The datasets used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge both Mohammad Ali Jinnah University (MAJU), Karachi, Pakistan, and Deanship of Research, Islamic University of Madinah, Madinah, Kingdom of Saudi Arabia, on the support provided for this research. This research was funded by Deanship of Research, Islamic University of Madinah, Madinah, Kingdom of Saudi Arabia.

References

V. Qazvinian, D. R. Radev, and A. Özgür, “Citation summarization through keyphrase extraction,” in Proceedings of the 23rd International Conference on Computational Linguistics (Beijing, China) (COLING ’10), pp. 895–903, Association for Computational Linguistics, Stroudsburg, PA, USA, August 2010.
View at: Google Scholar
H. Zha, “Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering,” in Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Tampere, Finland) (SIGIR ’02), pp. 113–120, ACM, New York, NY, USA, August 2002.
View at: Publisher Site | Google Scholar
A. Hliaoutakis, K. Zervanou, G. Euripides, E. G. M. Petrakis, and E. E. Milios, “Automatic document indexing in large medical collections,” in Proceedings of the International Workshop on Healthcare Information and Knowledge Management (Arlington, VA, USA) (HIKM ’06), pp. 1–8, ACM, New York, NY, USA, November 2006.
View at: Publisher Site | Google Scholar
Z. Bar-Yossef and M. Gurevich, “Estimating the impressionrank of web pages,” in Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain), pp. 41–50, New York, NY, USA, April 2009.
View at: Publisher Site | Google Scholar
O. Medelyan, E. Frank, and I. H. Witten, “Human-competitive tagging using automatic keyphrase extraction,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 – Volume 3 (Singapore) (EMNLP ’09), pp. 1318–1327, Association for Computational Linguistics, Stroudsburg, PA, USA, August 2009.
View at: Google Scholar
W. Wu, B. Zhang, and M. Ostendorf, “Automatic generation of personalized annotation tags for twitter users,” Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Los Angeles, CA, USA) (HLT ’10), Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 689–692, June 2010.
View at: Google Scholar
K. S. Dave and V. Varma, “Pattern based keyword extraction for contextual advertising,” in Proceedings of the 19th ACM International Conference on Information and Knowledge Management (Toronto, ON, Canada) (CIKM ’10), pp. 1885–1888, ACM, New York, NY, USA, October 2010.
View at: Publisher Site | Google Scholar
W.-T. Yih, J. Goodman, and V. R. Carvalho, “Finding advertising keywords on web pages,” in Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland) (WWW ’06), pp. 213–222, ACM, New York, NY, USA, May 2006.
View at: Publisher Site | Google Scholar
C. Wartena, R. Brussee, and W. Slakhorst, “Keyword extraction using word co-occurrence,” in Proceedings of the 2010 Workshops on Database and Expert Systems Applications (DEXA ’10) (Bilbao, Spain), pp. 54–58, IEEE Computer Society, Washington, DC VA, USA, August 2010.
View at: Publisher Site | Google Scholar
O. A. Khan and A. Karim, “MIKE: an interactive microblogging keyword extractor using contextual semantic smoothing,” in Proceedings of the 24th International Conference on Computational Linguistics (Mumbai, India) COLING ’12, pp. 289–296, Association for Computational Linguistics, Stroudsburg, PA, USA, December 2012.
View at: Google Scholar
X. Wu, F. Xie, G. Wu, and W. Ding, “Personalized news filtering and summarization on the web,” in Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence (ICTAI ’11), pp. 414–421, Boca Raton, FL, USA, November 2011.
View at: Publisher Site | Google Scholar
S. A. Hossain, A. S. M. M. Rahman, T. T. Tran, and A. E. Saddik, “Location aware question answering based product searching in mobile handheld devices,” in Proceedings of the 2010 IEEE/ACM 14th International Symposium on Distributed Simulation and Real Time Applications (DS-RT ’10), pp. 189–195, Fairfax, VA, USA, October 2010.
View at: Publisher Site | Google Scholar
K. Zoltán and S. Johann, “Semantic analysis of microposts for efficient people to people interactions,” in Proceedings of the 2011 RoEduNet International Conference 10th Edition: Networking in Education and Research, pp. 1–4, Iasi, Romania, June 2011.
View at: Publisher Site | Google Scholar
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning, “Domain-specific keyphrase extraction,” in Proceedings of the 16th International Joint Conference on Artificial Intelligence, (IJCAI ’99), pp. 668–673, San Francisco, CA, USA, July 1999.
View at: Google Scholar
F. Liu, D. Pennell, F. Liu, and Y. Liu, “Unsupervised approaches for automatic keyword extraction using meeting transcripts,” in Proceedings of the Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL ’09) (Boulder, CO), pp. 620–628, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009.
View at: Google Scholar
F. Liu, F. Liu, and Y. Liu, “A supervised framework for keyword extraction from meeting transcripts,” IEEE Transactions on Audio Speech and Language Processing, vol. 19, no. 3, pp. 538–548, March 2011.
View at: Publisher Site | Google Scholar
A. Hulth, “Improved automatic keyword extraction given more linguistic knowledge,” in Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP ’03) (Sapporo, Japan), pp. 216–223, Association for Computational Linguistics, Stroudsburg, PA, USA, July 2003.
View at: Publisher Site | Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, Stanford InfoLab, Stanford, CA, USA, 1999.
R. Mihalcea and P. Tarau, “Textrank: bringing order into texts,” Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), Association for Computational Linguistics, Stroudsburg, (Barcelona, Spain), pp. 404–411, July 2004.
View at: Google Scholar
X. Wan and J. Xiao, “Collabrank: towards a collaborative approach to single-document keyphrase extraction,” Proceedings of the 22nd International Conference on Computational Linguistics (COLING ’08), Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 969–976, August 2008.
View at: Google Scholar
X. Wan and J. Xiao, “Single document keyphrase extraction using neighborhood knowledge,” Proceedings of the 23rd National Conference on Artificial Intelligence: Volume 2 (AAAI ’08), Chicago, IL, USA, pp. 855–860, July 2008.
View at: Google Scholar
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, January 2003.
View at: Google Scholar
Z. Liu, W. Huang, Y. Zheng, and M. Sun, “Automatic keyphrase extraction via topic decomposition,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10) (Cambridge, MA, USA), pp. 366–376, Association for Computational Linguistics, Stroudsburg, PA, USA, October 2010.
View at: Google Scholar
Z. Liu, P. Li, Y. Zheng, and M. Sun, “Clustering to find exemplar terms for keyphrase extraction,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 – Volume 1 (EMNLP ’09) (Singapore), pp. 257–266, Association for Computational Linguistics, Stroudsburg, PA, USA, August 2009.
View at: Google Scholar
G. Tsatsaronis, I. Varlamis, and K. Nørvåg, “SemanticRank: ranking keywords and sentences using semantic graphs,” in Proceedings of the 23rd International Conference on Computational Linguistics, pp. 1074–1082, Association for Computational Linguistics, Stroudsburg, PA, USA, August 2010.
View at: Google Scholar
L. Zhang, J. Wu, Y. Zhuang, Y. Zhang, and C. Yang, “Review-oriented metadata enrichment: a case study,” in Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’09), pp. 173–182, ACM, New York, NY, USA, June 2009.
View at: Publisher Site | Google Scholar
M. Wenchao, L. Lianchen, and D. Ting, “A modified approach to keyword extraction based on word-similarity,” in Proceedings of the 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICICISYS ’09), pp. 388–392, Shanghai, China, November 2009.
View at: Publisher Site | Google Scholar
A. Panunzi, M. Fabbri, and M. Moneglia, “Keyword extraction in open-domain multilingual textual resources,” in Proceedings of the First International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution AXMEDIS’05, p. 4, Florence, Italy, November 2005.
View at: Publisher Site | Google Scholar
T. Takehara, S. Miki, N. Nitta, and N. Babaguchi, “Extracting Context Information from Microblog Based on Analysis of Online Reviews,” in Proceedings of the 2012 IEEE International Conference on Multimedia and Expo (ICME ’12) Workshops, pp. 248–253, Melbourne, VIC, Australia, July 2012.
View at: Publisher Site | Google Scholar
J. Tang, Z. Liu, M. Sun, and J. Liu, “Portraying user life status from microblogging posts,” Tsinghua Science and Technology, vol. 18, no. 2, pp. 182–195, April 2013.
View at: Publisher Site | Google Scholar
J. Liu and J. Wang, “Keyword extraction using language network,” in Proceedings of the 2007 International Conference on Natural Language Processing and Knowledge Engineering (NLPKE ’07), pp. 129–134, Beijing, China, August 2007.
View at: Publisher Site | Google Scholar
Y. Ouyang, W. Li, and R. Zhang, “273. Task 5. Keyphrase extraction based on core word identification and word expansion,” in Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ‘10). Association for Computational Linguistics, pp. 142–145, Stroudsburg, PA, USA, July 2010.
View at: Google Scholar
J. Hauffa, T. Lichtenberg, and G. Groh, “Towards an NLP-based topic characterization of social relations,” in Proceedings of the 2012 International Conference on Social Informatics (ICSI ’12),, pp. 289–294, Alexandria, VA, USA, December 2012.
View at: Publisher Site | Google Scholar
G. Berend and R. Farkas, “SZTERGAK: feature engineering for keyphrase extraction,” in Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ‘10), pp. 186–189, Association for Computational Linguistics, Stroudsburg, PA, USA, July 2010.
View at: Google Scholar
Y. J. Lui, R. Brent, and A. Calinescu, “Extracting Significant Phrases from Text,” in Proceedings of the 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW’07), pp. 361–366, Niagara Falls, ON, Canada, May 2007.
View at: Publisher Site | Google Scholar
G. Fang, C. Yuan, X. Wang, J. Li, and Z. Song, “From keywords to social tags: tagging for dialogues,” in Proceedings of the 2011 7th International Conference on Natural Language Processing and Knowledge Engineering (NLPKE ’11), pp. 106–113, Tokushima Japan, November 2011.
View at: Publisher Site | Google Scholar
C. Chitra and A. Julian, “Searching video blogs with integration of context and content analysis,” in Proceedings of the 2010 International Conference on Innovative Computing Technologies (ICICT) ’10), pp. 1–5, Karur, India, February 2010.
View at: Publisher Site | Google Scholar
G. B. Colombo, M. J. Chorley, V. Tanasescu, S. M. Allen, C. B. Jones, and R. M. Whitaker, “Will you like this place? A tag-based place representation approach,” in Proceedings of the 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pp. 224–229, San Diego, CA, USA, March 2013.
View at: Publisher Site | Google Scholar
A. Hulth, “Enhancing linguistically oriented automatic keyword extraction,” in Proceedings of the Human Language Technologies: The 2004 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Short Papers (HLT-NAACL-Short '04), pp. 17–20, Association for Computational Linguistics, Stroudsburg, PA, USA, May 2004.
View at: Google Scholar
L.-F. Chien, “PAT-tree-based keyword extraction for Chinese information retrieval,” in Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '97), pp. 50–58, ACM, New York, NY, USA, July 1997.
View at: Publisher Site | Google Scholar
E. Pianta and S. Tonelli, “KX: a flexible system for keyphrase extraction,” in Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval '10), pp. 170–173, Association for Computational Linguistics, Stroudsburg, PA, USA, July 2010.
View at: Google Scholar
P. Treeratpituk, P Teregowda, J. Huang, and C. Lee Giles, “SEERLAB: a system for extracting key phrases from scholarly documents,” in Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval '10), pp. 182–185, Association for Computational Linguistics, Stroudsburg, PA, USA, July 2010.
View at: Google Scholar
S. Muresan, E. Tzoukermann, and J. L. Klavans, “Combining linguistic and machine learning techniques for email summarization,” in Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7 (ConLL '01), p. 8, Stroudsburg, PA, USA, July 2001.
View at: Publisher Site | Google Scholar
M. Paukkeri, A. P. García-Plaza, S. Pessala, and T. Honkela, “Learning taxonomic relations from a set of text documents,” in Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT ’10), pp. 105–112, Wisla, Poland, October 2010.
View at: Publisher Site | Google Scholar
K. M. Carley, “AutoMap User’s Guide,” Center for Computational Analysis of Social and Organizational Systems (CASOS), Institute for Software Research International (ISRI), School of Computer Science, Carnegie Mellon University, June 2013.
View at: Google Scholar

Copyright

Copyright © 2022 Osama A. Khan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

797

Downloads

417

Citations

Complexity

Complexity and Robustness Trade-Off for Traditional and Deep Models 2022

Keyword Extraction for Medium-Sized Documents Using Corpus-Based Contextual Semantic Smoothing

Abstract

1. Introduction

2. Related Work

3. Corpus-Based Contextual Semantic Smoothing for Medium-Sized Documents

3.1. Parts of Speech Tagging

3.2. Labeling Corpus

3.3. Ratio Metric

3.4. Keyphrase Extraction

4. Data Analysis and Experimental Setup

4.1. Data Analysis

4.2. Experimental Setup

5. Experimental Results and Discussion

5.1. POS Tagging

5.2. Labeling Corpus

5.3. Ratio Metric

5.4. Keyphrase Extraction

5.5. CCSS Vs. State-of-the-Art Techniques

6. Conclusion and Future Work

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright