Abstract

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in the literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.

1. Introduction

Topic modelling is a text mining technique used to uncover latent topics in large collections of documents. The Latent Dirichlet allocation (LDA) [1] model is the state-of-the-art topic model. It has a proven history of success on long documents, such as news articles and e-books. Owing to the increasing popularity of microblogging websites, social media platforms, and online shopping (which involves product reviews), text that is significantly shorter has become increasingly relevant. Such sources of text potentially hold valuable information that can be useful in many applications, such as event tracking [2], interest profiling [3] and product recommendation [4].

Traditional topic models infer topics based on word co-occurrence relationships between words [5]. In order to extract meaningful topics, a topic model must successfully infer these relationships from a corpus. Per definition, short text contains few words and consequently tends to contain less co-occurrence information than long text. For instance, some platforms, such as Twitter, impose a character limit on each post, which severely constrains the amount of information one can share in a single post. This has created a need to reconsider topic model assumptions in order to overcome this challenge. One topic model which has shown better performance on short text than LDA is the Dirichlet-multinomial mixture model (DMM) [6, 7]. LDA is sometimes referred to as an admixture model [8] as it assumes each document contains several topics. In contrast, DMM is inherently a mixture model; thus, it assumes that each document contains only a single topic, which is a seemingly more sensible assumption for short text. This simple assumption is likely the reason for the better performance which has been observed on some short text datasets [911].

The conjugate Dirichlet prior allows for convenient hierarchical Bayesian modelling of count data using a multinomial distribution [12], and, over the years, most topic models have been built under the assumption that documents are sampled from a multinomial distribution. Another natural choice of distribution for count data is the Poisson distribution. However, it has received significantly less attention, as some researchers have found that it does not fit natural text [13]. Nevertheless, other researchers have found that the family of Poisson distributions produces good results on text (and other count data) categorisation, which is the motivation for our investigation into the Poisson distribution as a viable option for topic modelling of short text [14, 15].

The contributions of this work are as follows:(1)We propose a new topic model for short text, the Gamma-Poisson mixture (GPM) topic model, which has not been applied in the literature before. This model is based on the Poisson distribution, and we show that it is able to produce topics with improved coherence scores when compared to GSDMM (the collapsed Gibbs sampler version of DMM) [6].(2)We derive a collapsed Gibbs sampler for the estimation of the model parameters. This estimation procedure enables the model to estimate the number of topics automatically.(3)We perform extensive experiments in Python on three short text corpora and report on the characteristics of the new model.(4)We have also made available the development version of the GPM model in a Python package at https://github.com/jrmazarura/GPM.

Conventional topic models take advantage of word co-occurrence information in documents to infer the latent topics. However, due to its length, this kind of information is limited in short text, which poses a challenge when applying traditional topic models. It is for this reason that short texts are often described as being sparse. In order to overcome the challenges associated with topic modelling of short text, some researchers have proposed pooling or aggregating short texts to create longer documents prior to applying traditional topic models [3, 1619]. Others have successfully proposed modifications to conventional topic models, such as LDA or DMM. These modifications include, incorporating auxiliary information from external corpora [20, 21] and inducing sparsity into the models [2224]. Lastly, another popular approach is the derivation of completely new models [5, 25]. In light of the success of DMM on short text, the new model that we propose is a modification of DMM (the avid reader is referred to the following recent review papers for further reading on short text topic modelling: [26, 27]).

In the context of topic modelling, the multinomial distribution is most commonly used to model the words in a document. In contrast, significantly fewer topic models are constructed based on the Poisson distribution. Yet, in other text mining fields, such as in text classification [14, 15] and information retrieval [28], some researchers were able to obtain improved results with the Poisson distribution in comparison to the multinomial distribution in their applications. This serves as further motivation for considering the Poisson distribution as a basis for our topic model.

The Gamma-Poisson (GaP) model [29] and the Poisson decomposition model [30] are both examples of topic models that assume word counts follow a Poisson distribution. Other Poisson-based topic models are presented in [12, 31]. None of these models was specifically designed for short text and the authors only test their models on long documents. Our model is distinctly different from these in that it assumes each document contains a single topic, whereas these models assume that each document contains multiple topics. The Poisson-based Dirichlet multinomial mixture model (PDMM) [11] is another DMM-based topic model formed by incorporating a Poisson distribution in the generative process so as to allow each document to contain 1, 2, or 3 topics. In order to allow PDMM to also take advantage of semantic relations between words, Li et al. [11] extended PDMM by incorporating word embeddings through a generalized Pólya urn. Despite PDMM being termed “Poisson-based,” it still models word counts with a multinomial distribution.

Lastly, unlike the multinomial distribution, the Poisson distribution does not assume that occurrences of the same word are independent of each other [32]. Furthermore, as the Poisson distribution only has a rate parameter, the need to estimate the total number of trials, which is a nontrivial task, is evaded [14]. In light of these properties, we believe that a Poisson-based topic model could yield favourable results. In the next section, we investigate the characteristics of word frequencies in our datasets to further motivate the case for the Poisson distribution.

3. Empirical Analysis of Word Occurrences in Short Text

In contrast to the amount of literature available on multinomial-based topic models, there is significantly less research on topic models that are based on the Poisson distribution. This is likely due to the work of Church and Gale [13] which demonstrated that the Poisson distribution is not a good fit for observed word frequencies in real-world texts. They proposed a -Poisson mixture as a more suitable alternative. In order to motivate that the Poisson distribution fails to model word frequency, Church and Gale [13] selected a word from their corpus, “said,” and plotted the graph shown in Figure 1.

Figure 1 shows the number of documents in which the word “said” was used 0 times, 1 time, 2 times, …, or 32 times. The curve shows the predicted number of documents from a Poisson distribution calculated using the maximum likelihood estimate of the parameter. It is clear that the Poisson does not provide a good fit; thus, they proposed a mixture of Poisson distributions or a negative binomial distribution as better alternatives. However, it must be noted that the documents under consideration were long and different results may be observed when the same graph is plotted for a word in a short text corpus. To demonstrate this, we selected the word “jet” from topic 1 in the Pascal Flickr corpus (discussed in the Dataset section) and obtained the results in Figure 2.

The length of the documents considered in Figure 1 was approximately 2 000 words per document, whereas the average length of a document in the Pascal Flickr corpus was merely 5 words with a minimum and maximum length of 1 and 19 words, respectively. Considering this, it is highly unlikely that large frequencies would be observed. From Figure 2, the maximum frequency of the word “jet” is 1, and as we no longer have the heavy tail, the predicted values from the Poisson distribution (solid line) are close to the observed values. Similar results were observed with many of the other words in the corpus. Thus, we did not deem it necessary to model each word count with a -Poisson mixture as proposed by Church and Gale [13].

Another possible discrete distribution that may be considered is the negative binomial distribution. It is able to relax the assumption made by the Poisson distribution that the mean and variance of the data are equal. The negative binomial is the preferred choice when the observed data displays overdispersion; that is, when the variance exceeds the mean. Further investigation of the means and variances of the words in topic 1 of the Pascal Flickr dataset, as well as several other short text corpora, did not yield significant evidence to warrant the use of the negative binomial distribution.

Church and Gale [13] also identified a phenomenon referred to as burstiness. A word is said to be bursty or contagious if, after its first mention, it is likely to be observed again in the same document. In order to address word burstiness in the context of document classification, some authors have proposed the use of the Dirichlet compound multinomial [33], whereas others suggest using contagious distributions, such as the negative binomial distribution, to model word frequencies [14]. However, Figure 2 does not appear to display evidence of burstiness, neither did we observe evidence of significant burstiness in the short text corpora we studied.

In conclusion, we can see that a simple Poisson is a viable option to model word frequency in short text. In addition, it is also an attractive choice as it has a conjugate prior whereas the -Poisson mixture and negative binomial suggested by Church and Gale [13] do not. It is for these reasons that we consider topic modelling using the Poisson distribution. We now introduce a new topic model, the Gamma-Poisson mixture (GPM) topic model.

4. The Gamma-Poisson Mixture Topic Model for Short Text

Table 1 shows a summary of the notation that will be used throughout this paper.

The Gamma-Poisson mixture topic model is a hierarchical Bayesian model for topic modelling of short text. For simplicity, it assumes that the frequencies of words in a document are independent of each other and that the corpus is a mixture of documents, which belong to different topics. Mixture models are amongst the simplest of latent variable models. Considering the success of GSDMM (GSDMM is the collapsed Gibbs sampler version of DMM [6]; the authors in [6] used the abbreviation GSDMM (Gibbs Sampler DMM); we use this version in this paper; thus, from here onwards, we will also refer to the GSDMM as opposed to DMM) on short text [911], our GPM topic model makes similar assumptions: (1) documents are formed from a mixture model and (2) each document belongs to exactly one topic (cluster). This embodies the following probabilistic generative process for a document, :(1)A topic, , is randomly selected depending on mixing weights .(2)A document is then randomly selected from .

Consequently, the likelihood of a document is given bywhere denotes the total number of topics in the corpus. Like GSDMM, GPM makes the NaïveBayes assumption: given the topic, the frequencies of the words in the document are independent of each other. Thus, under GPM the conditional probability of a document given a topic is given bywhere denotes the frequency of word in document , and denotes the expected frequency of word on topic . The key difference between GPM and GSDMM arises at this point. The GPM assumes the frequencies, , are modelled according to independent Poisson distributions as opposed to modelling the joint distribution of the counts with a multinomial distribution as in the GSDMM. In addition, owing to its conjugacy, a gamma prior with shape parameter and scale parameter is imposed on . Similarly, owing to the Dirichlet distribution’s conjugacy to the multinomial, GSDMM assumes a Dirichlet prior.

Under the GPM, the mixing weights represent the proportion of each of the topics in the corpus. The topic assignment of each document is modelled by a multinomial distribution. Thus, where and . Furthermore, a Dirichlet prior with parameter is imposed on . As GPM is inherently a mixture model, this part of the model is the same as GSDMM.

The generative process of GPM can be summarised in a graphical model as in Figure 3.

Figure 3 describes the statistical conditioning of variables on their parent variables. The only random variable that is observed is the corpus, whereas all others are latent variables. In the following section, we will discuss the estimation procedure for the GPM.

5. The Collapsed Gibbs Sampler

A typical Gibbs sampler [34] requires that each parameter be sampled in turn conditioned on all the other parameters. As the topics are only dependent on the topic assignment of each document, it is only necessary to sample the topic assignments. The conjugacy of the chosen priors introduces analytic tractability that makes it possible to easily integrate out the other parameters that would otherwise need to be sampled. Thus, owing to this simplification, this sampling scheme is referred to as a collapsed Gibbs sampler.

Other estimation techniques, such as the Expectation-Maximisation (EM) algorithm, could have also been used. However, it is the collapsed Gibbs sampler that gives our model the favourable property of being able to automatically select the number of latent topics. In practice, one popular way of selecting the number of topics is achieved via the use of nonparametric topic models [35]. Thus, although parametric in nature, our model displays this “nonparametric” behaviour to some extent.

In order to estimate the model parameters, the collapsed Gibbs sampler assigns each document to a single topic. This is achieved by sampling from the conditional probability of document belonging to a class, . From the rules of conditional probability, it follows thatwhere the superscript is used to denote that document is excluded. and are the hyperparameters of the gamma prior, whereas denotes the hyperparameter of the Dirichlet prior.

In order to sample a topic assignment for each document according to equation (3), only the joint distribution, , is required. It is shown in the Appendix that it is given by

By substituting equation (4) into equation (3), under the assumption that , , and for all and , it follows that equation (3) can be expressed as

Thus, for each document, a topic is sampled repeatedly until convergence is achieved. The topics are then found by the following estimates:where the top words that describe topic are the words with the highest estimated expected frequencies, . Full details of the derivation are given in the Appendix.

The collapsed Gibbs sampler for the GPM is summarised in Algorithm 1.

Input: Corpus,
Output: Topic labels for each document,
Begin
 Initialise and to 0 for each topic
for each document ,
  randomly sample a topic for
  
   and
  for each word frequency in
   
for iterations
  for each document ,
   record the current topic of document :
    and
   for each word frequency in
    
   sample a new topic for
    equation (5)
    and
   for each word frequency in
    

Topic models are very powerful tools as they possess characteristics from both clustering and dimensionality reduction techniques. (1) A corpus is represented in a lower-dimensional form by a set of topics, and (2) similar to clustering, each document is associated with a single topic or multiple topics depending on the model. Our GPM topic model possesses both these qualities. The first property is captured by the parameters. The second is satisfied in equation (3). The advantage of topic models over traditional clustering algorithms is that “labels” are also produced, in the form of topics. Topic models are designed to not only provide data compression but to also produce interpretable topics.

In order to demonstrate the utility of our new model, we perform extensive experimentation on different datasets. Details are provided in the next section.

6. Experiments

6.1. Datasets

In order to test our model, we ran experiments on different datasets and compared the performance of GPM against that of GSDMM. The datasets have been summarised in Table 2. All statistics were collected from the datasets after basic preprocessing (removal of stop words, punctuation, special symbols and numbers).

The Tweet dataset [6] is a collection of tweets from the 2011 and 2012 Text REtrieval Conference. The most relevant tweets in 89 different categories were selected to create this collection. Each tweet is regarded as an individual document.

The Pascal Flickr dataset contains captions of images from Flickr and the Pattern Analysis, Statistical Modelling, and Computational Learning (PASCAL) Visual Object Classes Challenge [36]. The captions are divided into 20 different classes and altogether the corpus contains 4 821 captions which are each treated as individual documents.

The Search Snippets dataset [37] was created by first selecting 8 different domains: Business, Computers, Culture-Arts-Entertainment, Education-Science, Engineering, Health, Politics-Society, and Sports. For each domain, 11 to 118 related phrases were typed into the Google search engine, and then the snippets from the top 20 to 30 results were collected to create a corpus of 12 295 snippets.

Note, we will often refer to the original number of classes/categories for each dataset as the true number of topics/clusters or true .

All datasets can be obtained from https://github.com/qiang2100/STTM [26].

6.2. Experimental Design

All experiments were executed in Python 3.6 in Windows 10 on a computer with a 3.50 GHz quad core processor and 16 GB RAM. We used our own implementations of each model and have made our implementation of the GPM topic model publicly available as a Python package at https://github.com/jrmazarura/GPM. For the GSDMM, the parameter values were set to and the algorithm was run for 15 iterations, as in the original paper. For the GPM, the parameter plays the same role as the parameter in GSDMM; thus, it was also set to 0.1. For the Poisson distribution, we opted for a gamma prior with shape and scale parameters, and , both set to 0.001. This choice is motivated within the upcoming sections.

6.3. Document Length Normalisation

Since the Poisson distribution gives the probability of observing a given number of events in a fixed interval, it is necessary to normalise the lengths of the documents. This is achieved by replacing the word frequencies, , withwhere denotes a predefined length [15]. In all our experiments, we set and rounded off each to the nearest integer. was chosen to be 20 as it provided a good balance between performance and runtime for our datasets.

6.4. Model Evaluation

In order to evaluate the performance of our model, we used the average of the UMASS topic coherence [38] score for each topic. The coherence score for each topic, , is given bywhere denotes the th word in topic , denotes the number of documents in which words and co-occur, and denotes the number of documents in which word occurs. is a smoothing parameter to prevent taking the logarithm of zero and it is set to equal 1 as proposed in the original paper.

As with most topic models, the GPM is an unsupervised technique. Model evaluation is generally not a trivial task in the context of unsupervised learning as datasets lack labels upon which evaluations can be based. The UMASS coherence score is a well-known measure of the degree of interpretability of a topic and it has been shown to align well with human evaluations of coherence [38]. Naturally, topics that are coherent are most desirable; therefore, a higher average coherence score is preferable. Similar to GSDMM, our model has the special characteristic of being able to automatically select the number of topics; thus, the coherence score is only calculated on the topics found by the model.

7. Results and Discussion

7.1. Influence of the Starting Number of Topics

Topic modelling is typically an unsupervised technique. Similar to -means clustering, the number of topics (clusters), , is a challenge to select as the value is not usually known in advance. The GPM is able to infer the number of topics automatically provided that the starting value of is large enough. This is due to the dependence of the topic assignment probability, equation (5), on which is the number of documents in topic . This implies that a document is more likely to be assigned to a topic which has documents assigned to it, than a topic that does not have documents assigned to it.

As will be shown in the next section, the collapsed Gibbs sampler is quick to converge; thus, the Gibbs sampler was run for 15 iterations. As our model also provides stable and relatively consistent results (as will be shown in the next section), experiments were repeated 3 times assuming . We set for all and for all . Table 3 shows the average number of topics found by the model for some of the different starting values of whereas Table 4 shows the corresponding average coherence scores.

Figures 46 provide a visual summary of these results. According to Figures 4(a), 5(a), and 6(a), in all cases, the model approaches the true number of topics as the starting number of topics increases. In most cases, the most accurate number of topics was found by setting to 400. For the Tweet dataset, the model converges to between 70 and 80 topics, which is close to the true value of 89. For the other datasets, the model slightly overestimates the number of topics. On the Pascal Flickr dataset, at , the final number of clusters is overestimated by about 10 topics (true ), whereas on the Search Snippets dataset, the final number of clusters is overestimated by about 20 topics (true ). One possible reason for this difference could be that human labelling may have been too rigid, and documents were classified into too few topics, yet there may have been subtopics present. Consequently, it is possible that such a discrepancy could also arise if different human reviewers were tasked with labelling each document independently. In the context of topic modelling, this difference is not usually a problem, especially if the topics are interpretable, as the model may have simply identified subtopics present in the corpus. Since the model does not differentiate between “main” topics or subtopics, they would all be included together in the final topic count. Nonetheless, it is still striking that, in both cases, the model was able to automatically discard the extra 80–90% of topics that were unnecessary. This greatly alleviates the challenge of selecting the appropriate value of .

In topic modelling, one of the most important aspects is the interpretability of the uncovered topics. Even if the final number of clusters found is not necessarily the same as what human annotators would find, it is important that the words in the topics are coherent. Figures 4(b), 5(b), and 6(b) show that the coherence improves as the initial increases. In fact, there reaches a point where there is almost no more improvement in average coherence when the initial number of topics is increased. In most cases, there appears to be an insignificant improvement to the coherence score when is selected to be greater than 200.

7.2. Influence of the Number of Iterations

One of the challenges faced when using sampling methods to estimate parameters is determining the appropriate number of sampling iterations to perform. In order to investigate the performance of the models with respect to the number of iterations, we recorded the average coherence and number of clusters at each of 30 iterations. This was repeated three times for each dataset. From the previous results, we found that the number of clusters was close to the human annotated number and the coherence scores reached their maximum when the model started with 400 topics; thus, we use this value in all the experiments. In addition, we also set for all and for all . The results are shown in Figures 79. The graphs in (a) all show the number of clusters that the model found at each iteration, whereas the graphs in (b) show the topic coherence at each iteration. In general, similar patterns are observed. It is evident that convergence is reached quickly. In all cases, convergence is reached by the 15th iteration and the variation in the results is typically relatively small.

7.3. Influence of Alpha and Beta

The hyperparameters and represent the shape and scale parameters of the gamma distribution, respectively, and represents the hyperparameter of the Dirichlet prior. We assume that , , and for all and . The parameter is analogous to the parameter of the GSDMM. The authors of GSDMM conducted experiments to investigate the impact of different selections of on the number of clusters found, and they observed that it had a very small impact. As expected, we also observed similar results with GPM thus we only focus on the impact of and , assuming . The GPM was run on the Pascal Flickr dataset for , and . Then the final number of clusters found was recorded. The results on the Pascal Flickr dataset are shown in Figure 10.

Owing to the computationally heavy nature of performing a grid search, each experiment was run only once per pair of and values, with the starting number of topics set to be at least 20 more than the true value. Figure 10 shows a clear downward trend for all values of , the scale parameter. However, the final number of topics found is clearly influenced by the shape parameter, On the Pascal Flickr dataset, the model was only able to get close to the true number of topics (20) when was chosen to be near 0.5. Similar downward trends were also observed on the other two datasets and was also found to be of minimal impact on the number of topics found. However, for the Tweet dataset, was required to be near 0.05 for the model to find close to 89 topics, whereas the Search Snippets dataset required an value close to 1.5 to find close to 8 topics. Figure 11 shows the probability density functions of the gamma distributions with these different values of and a fixed value of . It is evident that these choices of alpha tend to produce skewed distributions, which place most of their probability near 0.

Based on the chosen values of and , the expected value of the gamma priors for the Tweet, Pascal Flickr, and Search Snippets datasets are 0.025, 0.25, and 0.75, respectively. Considering the short length of the documents and the massive sizes of the vocabularies, it is not surprising that most words will have very low observed frequencies. In fact, since many zeros are observed for each word, the estimates of the Poisson parameters are also very small which results in most of the probability being loaded on zero. For example, for where .

A similar comparison to that of Figure 10 was also done to investigate the impact of and on the coherence scores and the results from the Pascal Flickr dataset are shown in Figure 12. Figure 12 shows the average coherence of topics found for the same values of , , and used in Figure 10. The labels at each point indicate the number of topics found by the model.

In Figure 12, a general pattern is observed. The coherence scores appear to increase and then drop as increases and once again, does not appear to have a significant impact. The Tweet and Search Snippets datasets also displayed general patterns, but the pattern was not necessarily the same across the datasets. This simply serves as an indication that the selection of is not a trivial task.

It is evident from Figure 12 that increasing decreases the number of topics the model tends to find. Conversely, as decreases, the number of topics found increases. Interestingly, this behaviour of in GPM is similar to the behaviour that was observed with the parameter of GSDMM . According to equation (5), for small values of , the probability of a document belonging to a topic is more sensitive to , the number of times word is observed in topic . This means that when a topic has more words in common with a document, it is more likely to be assigned to that topic. On the other hand, when is large, the probability of being assigned to a topic is less sensitive to . Instead, the probability is influenced more by the first term in equation (5), which is dependent on the number of documents in topic . As a result, a topic with more documents is likely to get larger since equation (5) will assign more probability to topics that contain more documents. This explains the tendency of the model to assign all the documents to one topic when is large.

In practice, the number of clusters is not usually known in advance, so it is not possible to use the true to choose a suitable value for . Furthermore, the coherence is also not always highest at the true number of clusters. In order to overcome this challenge, we then considered setting the hyperparameters of the gamma prior to . (In Bayesian literature, the gamma distribution with shape and rate parameters both equal to 0.001 is a commonly used non-informative prior [39]. In this paper, the gamma distribution is parameterised by shape and scale parameters. Despite using the scale-parameter instead of rate-parameter formulation, we show empirically that choosing 0.001 yields better performance than other choices. This is likely because the data contains many zeros and this gamma places most of its probability around 0.) The top horizontal line in Figure 12 shows the coherence score found by the GPM under this gamma (0.001, 0.001) prior. (This result is for a fixed alpha and beta, but it is shown in the graph as a horizontal line across all alpha values to emphasize that this choice of parameter outperforms the GPM with other choices of alpha and beta. For ease of comparison, the GSDMM is also indicated by a horizontal line although its hyperparameters are also fixed). The coherence is not only higher than that of the other selections of and , but the GPM also outperforms the GSDMM model (indicated by the lower horizontal line). In addition, the average number of clusters found by the GPM was also close to the true value. It is for this reason that we recommend the use of and use these values in all our experiments.

In conclusion, it is clear that this selection of and greatly simplifies the topic modelling process for the GPM. In addition, we have also seen that the model possesses the flexibility of allowing the user to easily adjust the number of topics found by simply changing the value of .

7.4. Comparison with Dirichlet-Multinomial Mixture Model

The GSDMM model was originally presented as a clustering algorithm, as opposed to a topic model and was consequently assessed on its ability to cluster documents [6]. As the GPM is designed for topic modelling, we assess its ability to extract meaningful topics by investigating the topic coherence. Despite there being other topic models for short text, the GPM is related to the GSDMM in that it also makes the one-topic-per-document assumption and is able to automatically select the number of topics. In order to compare the performance of the GPM topic model against that of the GSDMM, both models were fitted to the datasets, and the results are summarised in the figures and tables that follow. All models were run for 15 iterations starting with 400 initial topics. This was repeated 10 times for each model with for all and for all .

Figure 13 shows boxplots of the topic coherence scores. It is evident that the GPM generally outperforms the GSDMM in all three datasets, as the topic coherence of the topics obtained by the GPM is mostly larger than that of the GSDMM.

For completeness, we also consider the number of clusters found by each model in Table 5. For the Tweet corpus, the true number of clusters, as determined by human annotators, is 89. On average, the GSDMM was more inclined to find more clusters than the GPM. It is also worthwhile to note that the results obtained for the GSDMM on the Tweet dataset are close to those obtained in the original paper [6]. On the Pascal Flickr and Search Snippets datasets, both models tended to find more clusters than those determined by the human annotators. However, the GPM was able to get closer to the true value than the GSDMM. Interestingly, on the Search Snippets corpus, the GSDMM found significantly more topics than were found by the GPM. It is likely the case that the GSDMM found finer-grained topics, thus increasing the number of topics found, whereas the GPM model discovered fewer, but more general, topics.

We now consider the actual topics found by the models on one of the datasets. We specifically focus on the Search Snippets results in order to observe what other topics were found by the GSDMM model that were not found by the GPM. Table 6 lists some of the top words for each of the topics found by the GPM (column 2), as well as possible labels for each topic (column 1). The labels have been assigned based on the original 8 topics of the dataset, and then a possible subtopic label was added in parentheses. This labelling and selection of subtopics was performed subjectively, so another annotator’s assessment may produce different results.

In assigning the topics to the predefined labels, one challenge faced was that some topics had potential overlaps. For instance, a topic in the Engineering category could also have fallen in the Education-Science category. By analysing the first column, we also observe that 7 out of the 8 original predefined topics appear to be represented in these results. According to our labelling, the missing topic is the Engineering topic. This is most likely due to the fact that only 369 of the 12295 documents belonged to this topic, which is merely 3% of the entire corpus. The proportions of each topic in the Search Snippets corpus are shown in Figure 14.

As was observed in Table 5, the GSDMM found more than 250 extra topics. Table 7 shows two additional subtopics for each of the 8 predefined categories that were found by the GSDMM, but not the GPM.

Since GSDMM found significantly more topics, it was able to uncover finer-grained topics. Thus, in such cases where a brief overview is desired, the model producing the smaller number of topics might be preferable. Where more detail is desired, one can opt for a model that produces more topics.

8. Conclusions and Future Work

Despite the lack of attention on the Poisson distribution in topic modelling, we have shown its utility in modelling short text. We proposed a new topic model for short text, the Gamma-Poisson mixture (GPM) topic model, and performed extensive experimentation in order to investigate its properties empirically. In addition, we also derived a collapsed Gibbs sampler for the model.

As is well-known in the field topic modelling, the selection of the appropriate number of topics is a challenge. Our new topic model is able to address this in its ability to automatically select the number of topics. This is achieved via the use of the collapsed Gibbs sampler. We also showed that our model was able to find estimates that were close to the true number of topics on labelled corpora. A further benefit of the collapsed Gibbs sampler is that it also converges very quickly, thus evading the need for long burn-in periods as is typical in the application of traditional Gibbs samplers.

In addition, GPM possesses the flexibility of allowing the user to adjust the number of topics found as required. It also tends to produce consistent results with little variation. Furthermore, when compared with the GSDMM, the GPM outperformed the GSDMM. Firstly, the number of topics found by GPM was closer to the true value. Secondly, GPM was able to find topics with higher average coherence scores, thus making it a good option for topic modelling on short text.

There are many avenues for future work related to the GPM. We plan to assess the GPM on other performance measures, such as classification accuracy in end-to-end classification. We will also perform further experimentation to compare it against other short text topic models.

Appendix

The Derivation of Collapsed Sampler for Gamma-Poisson Mixture Model

A summary of the notation used is given in Table 1.

Since the topic estimates are only dependent on the topic assignments, it is only necessary to sample the topic assignment for each document. This is achieved by sampling from the conditional probability of a document belonging to a class:where the superscript is used to indicate that document is excluded from and . and are the hyperparameters of the Gamma prior, whereas denotes the hyperparameters of the Dirichlet prior. In order to define equation (A.1), we need to find .

Owing to conditional independence between and , it follows that

It was shown in [6] that the second term on the right-hand side of equation (A.2) simplifies towhere and denotes the number of documents assigned to the th topic. Using the same notation as in [6], it follows that and .

Now, the first term on the right-hand side of equation (A.2), can be expressed as

Under GPM, documents and words are assumed to be independent. In addition, the word counts are assumed to follow a Poisson distribution. Thus, given the topics, the corpus can be modelled as

In order to simplify further derivation of the collapsed Gibbs sampler, equation (A.5) can be re-expressed by the introduction of , the number of documents assigned to the topic, and , the number of times word is observed in topic , as follows:where .

Now, by assuming a Gamma distribution for and substituting equation (A.6) into equation (A.4). The authors obtain

The integral is solved by multiplying the equation by a constant equal to 1. The result is a gamma distribution with parameters and . By substituting equation (A.3) and (A.7), (A.2) can now be written as

The derivation of the conditional distribution in equation (A.1) can now be concluded by substituting equation (A.8) as follows:where and . We also make use of the fact that the function has the property that . If it is assumed that , , and for all and , then equation (A.9) simplifies towhere is the length of th document and denotes the total number of documents in topic .

Lastly, after sampling from equation (A.9) or (A.10) until convergence, the lambda parameters, which give the topic distributions, are estimated by the posterior means. The posterior is given by

It follows that and the topic distribution estimates are given by

The top words that describe topic are the words with the highest expected frequencies, .

Data Availability

All datasets can be obtained from https://github.com/qiang2100/STTM.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was performed as part of the employment of the authors by the University of Pretoria and was supported by the Centre for Artificial Intelligence Research (CAIR).