Abstract

Huge data on the web come from discussion forums, which contain millions of threads. Discussion threads are a valuable source of knowledge for Internet users, as they have information about numerous topics. The discussion thread related to single topic comprises a huge number of reply posts, which makes it hard for the forum users to scan all the replies and determine the most relevant replies in the thread. At the same time, it is also hard for the forum users to manually summarize the bulk of reply posts in order to get the gist of discussion thread. Thus, automatically extracting the most relevant replies from discussion thread and combining them to form a summary are a challenging task. With this motivation behind, this study has proposed a sentence embedding based clustering approach for discussion thread summarization. The proposed approach works in the following fashion: At first, word2vec model is employed to represent reply sentences in the discussion thread through sentence embeddings/sentence vectors. Next, K-medoid clustering algorithm is applied to group semantically similar reply sentences in order to reduce the overlapping reply sentences. Finally, different quality text features are utilized to rank the reply sentences in different clusters, and then the high-ranked reply sentences are picked out from all clusters to form the thread summary. Two standard forum datasets are used to assess the effectiveness of the suggested approach. Empirical results confirm that the proposed sentence based clustering approach performed superior in comparison to other summarization methods in the context of mean precision, recall, and F-measure.

1. Introduction

The content shared by Internet users in online forum platforms is a valuable repository of information. Since information and communication technologies (ICT) are rising at a high pace, a bulk of data is available online. Many users use web services to share their knowledge about specific subjects, which exist on the web in the form of discussion forum, blogs, or any other user generated content [1].

Discussion forums are also known as web forum, message boards, and bulletin boards. In the current era, discussion forums are becoming very popular because these platforms give easy access to users to share their information and allow them to discuss issues/topics of common interest. Huge data on the web come from discussion forums, which contain millions of threads. These threads are a valuable source of knowledge for Internet users as they have information about various topics. The threads, also called discussion threads, are important for those who post as well as for “lurkers,” users who only read the replies. The discussion threads pertaining to single topic comprise a huge number of individual posts, which makes it hard for forum users to determine the most significant information in the thread.

A discussion thread is initiated when a user posts an initial post/question, and other users reply to that post, leading to an active discussion. As more and more users are involved in discussion, the number of replies for a given post within a thread increases, and this makes it difficult for users to read all the replies within a discussion thread [2].

In such a situation, the forum user will favor a short summary of the running discussion in order to get the idea of the long discussion thread in short time. It is hard for forum users to manually summarize the bulk of replies in the discussion thread. This necessitates an automatic solution for discussion thread summarization. With this motivation behind, this study has proposed a sentence embedding based clustering approach for discussion thread summarization. The approach will retrieve the most significant replies from the discussion thread to constitute a summary. Summarizing forum threads will assist the forum users to swiftly comprehend the main topic/idea of the discussion thread.

A few research studies applied both extractive and abstractive summarization techniques for discussion thread summarization such as e-mail threads summarization [3], summarizing written and spoken chats/conversations and meeting recordings [4], and summarizing Twitter topic [5].

Our approach for discussion thread summarization is different from prior approaches in few aspects. The approach is fully automatic and domain-independent, does not rely on any dictionary/resource, and does not involve any intervention of human to generate summary. It is a generic approach and is applicable to English discussion forum of any domain.

The suggested approach works in the following way: At first, we employed word2vec model to transform the collection of reply sentences into sentence embeddings/sentence vectors, which are obtained by averaging word embedding for each word in the reply sentences. Next, we applied K-medoid clustering algorithm to group semantically similar reply sentences in order to reduce overlapping reply sentences and at the same time generate distinct reply sentences in the summary. Finally, we used different quality text features to rank reply sentences in different clusters, and then the high-ranked reply sentences are selected from all clusters. The top ranked reply sentences from different clusters are combined to form a summary of discussion thread.

Our key contributions are given as follows:(i)To develop a sentence embedding based clustering approach integrated with quality text features for discussion thread summarization.(ii)To measure the efficacy of the suggested approach on two standard discussion forum datasets using ROUGE-N (N = 1, 2) evaluation metrics.

The paper is structured as follows: Related work is demonstrated in Section 2. Section 3 discusses the proposed summarization approach. The empirical results followed by discussion are described in Section 4. The conclusion, together with future work, is finally given in Section 5.

First, we discuss the prior works on extractive summarization methods and then discuss the previous research efforts attempted for discussion thread summarization.

The methods used for summarizing the text can be alienated into two classes: extractive summarization (ES) and abstractive summarization (AS). ES aims to retrieve the most significant text units from the source document and combines them to create a condensed form of the document. On the other hand, AS is a challenging task and it requires deep semantic representation and advance natural language processing (NLP) techniques to produce a shorter novel text summary.

In extractive summarization, identification of the most significant textual units can be considered as a ranking problem, a classification problem, or a selection problem [6].

In selection approach [7, 8], the textual units are chosen individually in decreasing order of their importance/relevance, and at the same time both importance and redundancy of previously chosen textual units are taken into account. The classification approach [8, 9] treats each text unit independently and classifies it as either salient or nonsalient. The ranking approach presented in [1017] assigns a salience score to each textual unit; then the textual units are arranged in decreasing order of salience scores, and the high significant units are chosen based on certain threshold or predefined cutoff (such as a defined number of text units/words).

For numerous document categories, the summarization units are typically sentences [18]. The units are usually utterances in meeting conversations summarization [1922], and the typical units are posts in case of discussion thread summarization [23, 24].

There are many application domains where ES techniques are used, for instance, summaries of websites, patents records, and news stories [25, 26]. Tseng et al. [27] proposed a feature based summarization method for patents and used several features like cue phrases and sentence location to identify the relevant sentences. Trappey et al. [28] combined ontology tree structure and TF-IDF techniques to extract the keywords for selection of salient sentences from patent documents; then clustering technique is applied to group salient sentences to form a summary. Vazhenin et al. [29] provided Google’s search engine an extended query based on WordNet to find the relevant webpages, and then the summary is created by choosing sentences form webpages having the most relevant keywords. Kallimani et al. [30] introduced a statistical method for news summarization. Different features such as term frequency, length of sentence, title of news article, proper nouns, and first sentence of news article are utilized to rank sentences in the news documents, and the high-ranked sentences are combined to create a summary. A pattern-based summarization method is introduced by [31] to score sentences in news documents by adding weights of the covered patterns.

Graph-based approaches [32], in the recent past, have also gained attraction and have been effectively utilized for extractive summarization. These approaches use either the PageRank (PR) algorithm [33] or its other variations to rank graph nodes that represent different text units such as sentences/passages. The idea of connectivity graph presented by [34] assumes that the nodes having more connections with other nodes in a graph will carry more salient information. The Lex-PageRank technique [35] based on the idea of eigenvector centrality, produced a connectivity matrix of sentences and used PR-like algorithm to rank relevant sentences for summary. Another variation of PR [36] is also used for ranking significant sentences for summary. van Oortmerssen [24] examined subtopic features for many documents and embedded these features in graph-based ranking procedure. Affinity-graph approach [37] for extractive summarization used PR-like algorithm to calculate the relevance score of sentences by taking into consideration their information richness. A graph-based model for summarization presented by [38] thoroughly analyzed the document set information and examined its global impact at sentence level. A summarization approach based on weighted graph model [39] merged clustering and ranking methods for selection of relevant sentences. Nguyen-Hoang et al. [40] employed graph-based PR algorithm for summarization of Vietnamese documents. An extractive approach based on event graph [41] used human crafted rules for generation of multidocument summary.

Recently, the emergence of deep learning (DL) and reinforcement learning (RL) approaches [4245] has gained attention of researchers, and their capabilities are exploited to enhance the text summarization task. However, the networks based on DL/RL need training on large amount of human crafted summaries, which are not easily available.

Discussion thread summarization (DTS) has been an exciting area of research, and some research efforts have been made in this direction over the last decade [24, 46, 47]. The goal of DTS is to select the most relevant reply posts in a discussion thread and merge them to form a concise thread summary. Most of the current work focused on comment threads summarization on news websites [4853]. Previous research studies have effectively attempted abstractive summarization (AS) techniques for the task of DTS, such as e-mail threads summarization [3], summarizing written and spoken chats/conversations and meeting recordings [4], and summarizing Twitter topic [5].

In this study, we choose to propose extractive approach for discussion thread summarization. Individual replies of the forum users are assumed to be not rephrased in a different way during the retrieval of discussion threads.

Our work is a bit related to the works presented by the authors of [54, 55] who proposed a technique that combined topic modeling with clustering to produce a summary from forum posts. The technique was assessed on DUC 2007 standard dataset (for multidocument summarization) and the private discussion forum data. The evaluation results of the technique were compared with MEAD (a centroid-based summarization approach) [56]. The technique performed better than MEAD on DUC 2007; however, its performance was not consistently improved on forum data [54]. Bhatia et al. [55] treated discussion thread summarization (DTS) as a postclassification task, where the job is to classify a given forum post as either relevant to the summary or not. The classification was performed in a supervised manner as several features were used. The method was assessed on two standard forum datasets: Ubuntu and New York City (NYC). Ren [48] introduced a forum summarization technique that modelled the structures of forum replies in a discussion thread. The next section demonstrates the proposed methodology of our approach.

3. Proposed Methodology

The research framework of our proposed study is illustrated in Figure 1. It is composed of six phases: (1) preprocessing, (2) reply sentence embedding, (3) semantic clustering of replies, (4) text features extraction, (5) ranking of reply sentences, and (6) summary generation.

3.1. Preprocessing

The most significant procedure in computational linguistic/text summarization is data preprocessing. As the proposed work is related to discussion thread summarization, the preprocessing of thread documents is needed to speed up the subsequent computational steps. The preprocessing phase includes four steps, which are discussed as follows:(a)Sentence segmentation: this step split the text into sentences by detecting boundary within text. Generally, an interrogation sign (?), sign of exclamation (!), or full stop/period (.) is used to indicate the sentence boundary [57].Consider the following snippet of thread document: “I like Ubuntu. It is amongst the best flavors of Linux.”After segmenting the thread document, we get two sentences.Input forum text:“I like Ubuntu. It is amongst the best flavors of Linux.”Output:Segment 1: “I like Ubuntu.”Segment 2: “It is amongst the best flavors of Linux.”(b)Tokenization: it is the procedure of segmenting sentences into distinct tokens (or words). Different whitespaces such as tabs, blanks, and punctuation symbols such as comma, semicolon, period, and colon are used as main cues for dividing the text into words.(c)Stop words elimination: these are the words which exist in the thread document with high frequency. Stop words include prepositions, articles, conjunctions, and frequently occurring words like “an,” “the,” “a,” “I,” and so forth. These words convey minute or no meaning in the forum thread document, so elimination of stop words from the thread document will assist in boosting the system performance. This work used a list of stop words proposed by Buckley [58].(d)Word stemming: it is one of the significant tasks of preprocessing phase. It converts the derived terms to its root term for grasping the similar notion. This study used a renowned algorithm called Porter’s stemming [58], which eliminates suffixes from the derived words. For instance, the stemming algorithm will convert the words “playing,” “played,” and “plays” to their stem word “play” by removing suffixes -ing, -ed, and -s.

3.2. Reply Sentence Embedding

The purpose of this step is to get sentence embeddings from the thread replies sentences. Sentence embedding is a numerical representation of text and is efficiently used by machine learning algorithm. It is a richer representation of text which preserves both semantic and syntactic information in sentences and leads to enhanced performance in almost each NLP task.

We used word2vec model to get sentence embeddings from the collection of reply sentences by extracting word vectors/embeddings for each individual word/token in the reply sentences. This study used the pretrained word2vec model [59, 60], released by Google, to learn word vectors (word embeddings) for each word in the reply sentences. The trained word2vec model is based on neural network architecture and employs a continuous Bag of Words to learn distributed vector representations of words. It is trained roughly on one hundred billion words present in Google News dataset. We set length of word vector to the default 300 features.

Finally, sentence embeddings/sentences vectors are obtained by using the average of all word vectors that are present in word2vec vocabulary and ignoring the missing word vectors. Each reply sentence is now expressed as a numeric vector. We also stored the sentence text along with its numeric representation for later use in sentence selection for summaries based on different text features.

3.3. Semantic Clustering of Reply Sentences

Usually, different users in a discussion thread apparently post different replies to the initial question post but they carry the same meanings; so it is a good idea to cluster semantically similar replies in a thread. In this study, we employed an unsupervised machine learning algorithm called K-medoid to group semantically similar reply sentences in order to reduce overlapping reply sentences and at the same time generate distinct reply sentences in the summary.

The K-medoid algorithm [61] is a partitioning clustering approach that separates the dataset of n data points (reply sentences) into K predefined distinctive nonoverlapping groups called clusters, where each data point (reply sentence) goes into one cluster. K-medoid clustering is more stable and less susceptible to outliers and noise in comparison to K-means clustering algorithm, since it uses medoids (actual points) as cluster centers instead of average of points used in K-means. Moreover, it is fast and converges in a specific number of steps; it is simple and easy to implement. In this clustering approach, each cluster is characterized by one of the data points in cluster, and such data points are called cluster medoids.

K-medoid is also known as partitioning around medoid (PAM). The term medoid refers to the point in the cluster whose average dissimilarity with all other points in the cluster is minimal. The key objective of K-medoid clustering is to diminish the summation of dissimilarities between data points in a given cluster and the respective cluster medoid (cluster center). The cost of K-medoid algorithm [61] is given as

The pseudocode of K-medoid clustering algorithm is as follows:(1)Initialization: Select medoids by randomly selecting k points from a set of n data points.(2)Link each data point to the nearby medoid by using Manhattan distance.(3)While the cost reduces, for every medoid c, for every data point p not selected as medoid:(a)Exchange c and p, link each data point to the nearest medoid, compute the cost again.(b)Compare total cost with the ones in the previous step; if it is greater then undo the exchange.

In this work, we selected k = 10 to be the optimum number of clusters calculated using silhouette method. The number of sentences that will form the final summary is dependent on the optimum number of clusters. Here the number of selected clusters is ten, and we choose top scored representative, salient, and information-rich sentences from ten clusters based on different text features to yield a summary.

3.4. Quality Text Features Extraction

Text features play important role in choosing salient content for summary generation. In order to score reply sentences in different clusters, we chose eight different quality text features as discussed below. This step extracts 8 different quality text features from reply sentences in each cluster formed in previous step. The feature values are normalized between 0 and 1. Next, we briefly discuss the text features used in this work.

3.4.1. Semantic Distance between Thread Reply and Thread Centroid

This feature computes the similarity between thread reply sentence and thread centroid. The reply sentences that are semantically similar to thread centroid are believed to be appropriate for the final summary. We employed TF-IDF technique to determine the thread centroid, which represents the most important features/words in a thread. Once the thread centroid is computed for a given discussion thread, then we determine the semantic distance between thread centroid and thread replies by using a technique called word mover’s distance (WMD). WMD is a word embedding technique that uses Google’s pretrained word2vec model to get similarity between thread reply and thread centroid.

3.4.2. Cosine Similarity between Reply and Thread Centroid

Reply sentences closely related to the thread centroid are assumed to be important for inclusion in summary. This feature finds the cosine similarity between vector representations of thread centroid and thread reply sentences.

3.4.3. Unique Words Count in a Reply

This feature finds the number of unique words in a thread reply sentence. The reply sentences with unique words are considered appropriate for summary.

3.4.4. Common or Overlapping Words between Thread Reply and Initial Post

A thread reply sentence is important for summary if it has matching words in the initial post. This feature determines the number of common or overlapping words between thread reply and initial post using Jaccard similarity.

3.4.5. Semantic Distance between Thread Reply and Thread Title

A thread reply sentence that is semantically similar to the thread title is considered to be significant for summary. We used WMD to determine the semantic distance between thread title and thread reply sentence.

3.4.6. Semantic Distance between Thread Reply and Initial Post

A thread reply sentence that is semantically similar to the initial post is considered to be salient for summary. WMD is employed to get the semantic gap/distance between thread reply sentence and initial post.

3.4.7. Reply Sentence Length

It determines the length of the reply sentence by finding the number of words in it.

3.4.8. Number of Verbs and Nouns

This feature determines the number of verbs and nouns in a thread reply.

3.5. Ranking of Reply Sentences

This objective of this step is to select the best scored reply sentences from the diverse clusters based on different quality text features as discussed in Section 3.4. Each reply sentence in a cluster is represented by a vector of 8 dimensions, whereas each dimension represents the feature score.

Once the features scores for reply sentences in different clusters are calculated, these features scores are summed up to get a ranking score for each reply sentence as given in the following equation:where represents the ranking score of reply sentence and indicates the reply sentence features score. Once the score of reply sentences is obtained using equation (11), the reply sentences are ranked in different clusters based on these scores, and the top scored sentences are picked from all clusters. In this study, the number of selected clusters is 10, so we choose ten representative reply sentences from 10 clusters for final summary generation.

3.6. Summary Generation

In this step, the reply sentences with maximum rank score are chosen from each cluster as representative sentences to form an extractive summary of discussion thread. The rank score for each reply sentence within cluster is obtained based on different quality text features that are discussed in previous section.

4. Experimental Settings

4.1. Datasets for Evaluation

Our sentence embedding based clustering approach for discussion thread summarization and other state-of-the-art clustering techniques are evaluated on two publicly available discussion forum datasets—technical discussion forum for Ubuntu Linux distribution (http://ubuntuforums.org) and nontechnical discussion forum for New York City (NYC) called online TripAdvisor (https://www.tripadvisor.com.my/ShowForum-g28953-i4-New_York.html). Hundred discussion threads were randomly chosen from both datasets. Each thread has initial post called question and associated replies as candidate answers. There are a total of 756 replies in Ubuntu dataset and 788 replies in NYC dataset.

For each discussion thread in both Ubuntu and NYC datasets, there are also associated Gold summaries created by 2 human annotators named as Annotator-1 and Annotator-2. The effectiveness of the proposed method is assessed using ROUGE-N (N = 1, 2) evaluation metrics.

4.2. Experimental Steps

Given the corpus of discussion threads, at first, the preprocessing techniques are applied to divide the corpus into sentences, split the sentences into tokens (words), and eliminate the stop words. Then, the procedure of porter stemming is applied to the remaining words in order to transform them into their root words. Next, word2vec model is utilized to transform the collection of reply sentences into sentence embeddings/sentence vectors by extracting word embedding for each word in the reply sentences. We used the pretrained word2vec model to learn word vectors for each word in all reply sentences. We set the length of word vector to the default 300 features. Finally, sentence vectors/embeddings are obtained by using the average of all word vectors in reply sentences. Next, we employed K-medoid clustering algorithm to group semantically similar reply sentences in order to reduce overlapping reply sentences. We chose 10 optimal clusters in this work. Different quality text features are used to rank reply sentences in different clusters, and sentences with high ranks are picked from all clusters. The top ten representative reply sentences from 10 different clusters formed the final extractive summary.

In order to assess the performance of the proposed sentence embedding based clustering approach for discussion thread summarization, we set up two comparison models for summarization task. The first model is fuzzy c-means clustering (FCM) [61], which attempts to divide a finite collection of n data points into a collection of K clusters by linking each data point with all clusters through a real valued vector of indexes. Unlike traditional clustering, each data point in fuzzy clustering goes to one or more clusters at the same time.

The second model is K-means (KM) clustering [62], which attempts to divide the dataset into K predefined distinctive nonoverlapping groups called clusters, whereas each data point is allocated to a single cluster. The data points in our case refer to reply sentences (represented as sentence embeddings/vectors). The key objective of the KM algorithm is to reduce the sum of semantic gaps/distances between data points and their respective cluster centroids.

This research uses ROUGE-N (N = 1, 2) assessment metrics to compare the efficiency of suggested summarization method with other comparison models.

ROUGE evaluation metric is effectively applied in the field of extractive summarization task [63]. ROUGE-N describes an intersection of n-grams between the system generated summary and human annotator (reference) summary and is determined using the following equation:where countmatch (gramn) is the highest n-gram, which exists at the same time in both system (machine) summary and human (annotator) summaries, and gramn is length of n-gram.

The different measures for system summary [63] are determined as follows:

Tables 1 and 2 illustrate the outcomes of comparative assessment of the proposed summarization approach and other approaches determined using ROUGE-N (N = 1, 2) measures. These outcomes are obtained from a subset of 100 randomly chosen discussion threads from Ubuntu dataset. Considering the outcomes of ROUGE-1 shown in Table 1, the proposed sentence embedding based clustering approach for discussion thread summarization works better than other clustering techniques based on average precision, F-measure, and recall. On the other hand, FCM clustering approach generates better summarization results than KM clustering approach.

Likewise, considering the ROUGE-2 findings shown in Table 2, the proposed summarization method also outperforms other clustering-based techniques based on different measures. FCM clustering was sustained to produce improved summarization outcomes compared to KM clustering.

For Ubuntu dataset, Figures 2 and 3 illustrate the summarization outcomes of the suggested approach and other summarization approaches, calculated using ROUGE-N (N = 1, 2) metrics.

Similarly, for NYC dataset, Tables 3 and 4 show the outcomes of comparative assessment of the proposed summarization method and other comparison models using ROUGE-N (N = 1, 2) measures. These results are also obtained from a subset of 100 randomly chosen discussion threads from NYC dataset.

Considering the ROUGE-1 outcomes shown in Table 3, the proposed clustering technique gives superior summarization results compared to other clustering techniques based on average recall, precision, and F-measure. On the other hand, FCM clustering yields better results than KM algorithm in terms of average precision; however, it yields lower average recall and average F-measure than KM algorithm.

The ROUGE-2 results given in Table 4 show that proposed sentence embedding based clustering approach for discussion thread summarization is still stable and performs better than other clustering techniques. However, the performance of KM algorithm for summarization task on NYC dataset is better than FCM algorithm. For NYC dataset, Figures 4 and 5 depict the summarization outcomes of the proposed method and other related summarization models, determined using ROUGE-N (N = 1, 2) metrics.

4.3. Discussion

This section illustrates the thread summarization outcomes of our proposed sentence embedding based clustering approach and other comparison summarization approaches in the context of Ubuntu and NYC discussion forums datasets. The efficacy of the suggested approach and other summarization approaches is measured in terms of average recall, precision, and F-measure achieved with ROUGE-N (N = 1, 2) metrics.

Referring to the outcomes of ROUGE-1 shown in Tables 1 and 3, the proposed approach gave better summarization results in comparison to FCM and KM clustering algorithms in the context of precision, recall, and F-measure. On the other hand, FCM gave better summarization results than KM algorithm on Ubuntu dataset. However, KM algorithm showed better performance than FCM in terms of average recall and average F-measure on NYC dataset.

It can be observed from the ROUGE-2 results given in Tables 2 and 4 that the proposed approach showed stable performance and gave superior summarization results than other clustering techniques. On the other hand, ROUGE-2 results of K-means algorithm are better than fuzzy c-means clustering algorithm. Experimental results support that proposed sentence embedding based clustering approach showed stable and improved performance on both Ubuntu and NYC datasets as compared to other clustering techniques. In essence, it can also be observed that sentence embedding based clustering approach combined with ranking procedure based on quality text features boosted the summarization results.

We also conducted statistical T-tests to validate the empirical results in order to reveal the enhancement of our suggested approach over other summarization models. The paired-samples T-test procedure is used in this work to find the mean difference of two outcomes that express the same test set and got tiny significance values of 0.032, 0.027, and 0.025 for average precision, recall, and F-measure, respectively. The tiny significance values produced for the T-test (usually < 0.05) illustrate that the outcomes of the suggested approach and summarization models to be compared are significantly different.

5. Conclusion and Future Work

Discussion thread summarization is a daunting task, and this work sets a viable direction for thread summarization task. We introduced a sentence embedding based clustering approach that takes semantic representation of thread replies, groups semantically similar replies in different clusters, and then creates extractive summary by selecting the top ranked replies from different cluster based on various quality text features. The summary gives a gist of enormous amount of thread replies. The proposed approach is fully automatic and generic and is appropriate for discussion forums from different domains. From the experimental findings, we have confirmed that the proposed method has produced better results compared to other summarization methods. In the future, we are planning to use deep learning models to create extractive/abstractive summary of the discussion threads. In addition, we expand our methodology to other domains and inspect the usefulness of the proposed methodology.

Data Availability

The data are publicly available at https://ubuntuforums.org and https://www.tripadvisor.com.my/ShowForum-g28953-i4-New_York.html.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by Islamia College, Peshawar, and Higher Education Commission (HEC) of Pakistan.