Abstract

Topic modeling is a probabilistic generation model to find the representative topic of a document and has been successfully applied to various document-related tasks in recent years. Especially in the supervised topic model and time topic model, many methods have achieved some success. The supervised topic model can learn topics from documents annotated with multiple labels and the time topic model can learn topics that evolve over time in a sequentially organized corpus. However, there are some documents with multiple labels and time-stamped in reality, which need to construct a supervised time topic model to achieve document-related tasks. There are few research papers on the supervised time topic model. To solve this problem, we propose a method for constructing a supervised time topic model. By analysing the generative process of the supervised topic model and time topic model, respectively, we introduce the construction process of the supervised time topic model based on variational autoencoder in detail and conduct preliminary experiments. Experimental results demonstrate that the supervised time topic model outperforms several state-of-the-art topic models.

1. Introduction

Nowadays many kinds of text information like news, blogs, books, and social network accompany the daily lives of people. Some traditional methods are difficult to get effective information from an increasingly large amount of data. The probabilistic topic model is a new technology that can help people to organize, index, search, and browse these kinds of large data automatically.

Latent Dirichlet allocation (LDA) [1] is a classical topic model to find the representative topic of a document. During the past decade, LDA has been successfully applied to various document-related tasks, such as text classification [2], clustering [3], and summarization [4]. However, as a static topic model, LDA has two limitations.

Firstly, the number of topics is hard to determine in the static topic model. To select the best number of topics, most of the methods use the lowest perplexity or the largest likelihood estimate value of the topic model by comparing different numbers of topics [1, 5, 6]. In fact, these methods lead to the fact that the latent topic content is difficult to interpret. Secondly, the text information can be exchanged in the static topic model [1]. Moreover, the static topic model assumes that the text information is unordered with each other [7]. This simplified assumption is inappropriate and unrealistic [8].

In order to overcome these limitations, the supervised topic model and time topic model are proposed, respectively. In recent years, methods of the supervised topic model and time topic model have a variety of applications.

The objective of the supervised topic model is to learn topics from documents annotated with multiple labels. To the best of our knowledge, labeled LDA [9] is the classical supervised topic model that matches the multiple topics to the labels in the document. The number of topics is determined by the metadata of the document (such as labels), and topic terms have a better way to interpret topics [10]. In addition, the supervised topic model supports a variety of applications, such as social event analysis [11], abnormal event detection [12], document classification [13], and tag recommendation [14].

The objective of the time topic model is to construct a topic model that evolves over time in a sequentially organized corpus. To the best of our knowledge, the dynamic topic model is the first time topic model that captures the evolution of topics in a sequentially organized corpus of documents [7]. Based on the time topic model, a series of applications are researched, for example, dynamic features extraction [15], automated behaviour analysis [16], travel recommendation [17], and tracking urban geotopics [18].

However, there are some documents with multiple labels and time-stamped in reality, for example, a scientific paper has keywords and is time-stamped. Park et al. proposed a similar supervised time topic model, the main idea of which is on generating the numerical time-series variables as supervised metadata that is a single topic of each time slice [19]. This model is not considered as a supervised time topic model which can learn a topic from documents with multiple labels and which are time-stamped and apply the topic to various document-related tasks. As a result, it is necessary to propose a method to construct a supervised time topic model. To the best of our knowledge, this is the first work of constructing a supervised time topic model. The contributions of this paper are as follows.

We reveal and discuss the limitations of the main current relevant works on the topic model, and thus to construct a supervised time topic model is very necessary. We propose a method for constructing a supervised time topic model based on variational autoencoders, denoted by ST-TM, which is designed to deal with documents that have multiple labels and are time-stamped. The reasoning and constructing process of ST-TM is presented in detail. For a preliminary evaluation, we compare ST-TM with the state-of-the-art methods in the experiments. The results show that our proposed method is more effective.

The rest of the paper is organized as follows: we firstly review related research works. Secondly, the method for constructing a supervised time topic model based on variational autoencoders is described in detail. Then, the experiments of our proposed method are conducted. Finally, this paper is concluded.

In this section, we review some representative works on the supervised topic models and time topic models in detail, respectively. Based on analysing the limitations of related works, we present a supervised time topic model to address these limitations.

2.1. Supervised Topic Model

To the best of our knowledge, supervised latent Dirichlet allocation is the first supervised topic model that adds to a topic associated with each document and uses variational methods to handle intractable posterior expectations [20]. Since then, several supervised topic models have been proposed, for example, discriminative LDA [21] and maximum entropy discrimination LDA [22]. The above methods train only a single topic for each document.

It is well known that a document with only a single topic is inappropriate. For example, a document on social education includes both social and educational topics. Based on this fact, some supervised topic models with multiple topics are proposed. Labeled LDA matches the multiple topics to the labels in the document [9]. The number of topics is determined by the metadata of the document (such as labels), and topic terms have a better way to interpret topics [10]. Partially labeled LDA learns latent topic structure within the scope of observed, human interpretable labels [23]. Nonparametric labeled LDA uses the Dirichlet process with mixed random measures as a base distribution of the hierarchical Dirichlet process framework [24]. Dependency-LDA further considers the label frequency and label dependency observed in the training data for constructing the supervised topic model [2].

We review Labeled LDA (L-LDA) that is representative work on the supervised topic model. In order to contain the supervision, L-LDA applies a 1 : 1 correspondence between topics and labels. In addition to labels, keywords of scientific papers and categories of news are also considered as topics [10]. L-LDA is a probabilistic graphical model that describes a process for generating a labeled document collection. The graphical model of L-LDA is shown in Figure 1.

In Figure 1, two observable parts of the grey fill and are explicit variables that indicate labels and terms, respectively. Nodes with no fill are implicit variables and unobservable. Unlike LDA, both the label set and the topic prior influence the topic mixture . L-LDA assumes that each document is restricted to a multinomial distribution over labels which are part of the corpus. Each label is represented as a topic that is a multinomial distribution Φc over the terms. The generative process of L-LDA is shown in Table 1. Different from LDA, the whole generative process of L-LDA contains topic constraint. The more detailed descriptions of L-LDA are presented in literature [9].

2.2. Time Topic Model

The time topic model is developed to analyse the time evolution of topics in large document collections. To the best of our knowledge, the dynamic topic model is the first time topic model that captures the evolution of topics in a sequentially organized corpus of documents [7]. In order to simplify the reasoning procedure of the time topic model, the method uses a variational autoencoder to construct the time topic model [8]. To build cross-lingual tools, a multilingual dynamic topic model is proposed that can capture cross-lingual topics that evolve across time [25]. We review the dynamic topic model (DTM) that is representative work on time topic model. DTM supposes that the data is divided by time slice, for example, by decades. The graphical model of DTM is shown in Figure 2.

In Figure 2, is the vector of natural parameters for topic evolving over time and is the vector of mean parameters of the logistic normal distribution for the topic proportions. However, the Dirichlet is not amenable to the sequential model. Literature [7] chains the vectors of parameters and in a state-space model that evolves with Gaussian noise, respectively. By chaining together topics and topic proportion distributions, DTM has sequentially tied a collection of topic models. The generative process for slice t of a sequential corpus is shown in Table 2. The whole process of DTM is restrained by time slice t; the more detailed descriptions of DTM are presented in the literature [7].

Using the above methods, the supervised topic model and time topic model are implemented for mining topics. However, some documents have multiple labels and are time-stamped in reality; for example, the scientific paper has keywords and is time-stamped. And there are few research papers on the supervised time topic model to learn topics for these documents. As a result, it is necessary to propose a method to construct a supervised time topic model for documents with multiple labels and time-stamped.

3. Methods

In order to make the proposed method for constructing a supervised time topic model easier to describe and understand, we summarize some major notations needed in our formulation in Table 3.

The method of constructing a supervised time topic model is based on the generative process of the supervised topic model and time topic model. The difficulty is the reasoning of the loss function. In this section, we describe the method for constructing a supervised time topic model based on variational autoencoders in detail and denoted by ST-TM; the graphical model of ST-TM is shown in Figure 3.

The generative process of ST-TM is different from L-LDA and DTM by dividing time slices and having the supervision for each time slice, respectively. Moreover, unlike DTM, ST-TM removes transitive dependencies for and so that the reasoning process of the model can further simplify. The generative process of ST-TM is shown in Table 4.

The implementation of ST-TM depends on a variational autoencoder that is a deep learning technique for learning latent representations. The variational autoencoder is composed of an encoder and decoder. In time slice , the encoder is to generate a variational approximate posterior distribution of the document-topic. Decoder estimates the optimal generation probability with known. The network structure of the variational autoencoder of the topic model is shown in Figure 4.

In Figure 4, the model assumes obeys diagonalization of the covariance matrices of the Gaussians distribution. The encoder trains the potential mean and variance for the vector of term-topic . In order to generate documents that are as close to the input documents , the decoder estimates the better probabilistic model according to the known . The generation process of needs to use backpropagation to calculate the gradient of the error function. Generally, we want to minimize an expected cost by using gradient descent that requires computing the gradients. The generation process of adopts the reparametrization trick that is used in computing the gradients. The graphical model of the reparametrization trick is shown in Figure 5.

In Figure 5, the random variable (at left) can be reparameterized as a Gaussian function and a random variable . More specifically, the reparametrization trick in this paper involves a function of a standard Gaussian variable that is used to rewrite , such that . Therefore, and only involve linear operation that can be optimized by stochastic gradient descent algorithm easily.

In this paper, the encoder has a 2-layer neural network to generate the variational approximate posterior distribution of document-topic. The decoder estimates the better probabilistic model according to the known . According to the generation process of ST-TM, the marginal likelihood function of generating documents as a variational target is shown in equations (1) and (2):

Unlike Table 3, is the number of documents in time slice . The variational functions are introduced for calculating . According to Jensen’s inequality, the marginal logarithmic likelihood function of generated document is as follows:where is the variational lower bound for the marginal logarithmic likelihood function , if and only if . In order to find the best which should be able to approximate the true posterior distribution , the most common approximate calculation of uses the Kullback–Leibler divergence (KL-divergence) between and . Our aim is to get which is different from the precise KL-divergence . Details are shown as follows:where the first part is the expectation of reconstruction error and the second part is the KL-divergence of the approximate posterior distribution from the true posterior distribution. In time slice, has independent samples (, , , ). As the first part of equation (6), is generated by each sample with one point and then and obeys multivariate Bernoulli distribution. The expectation of reconstruction error can be measured by the cross entropy cost function, which is often used to measure the difference between the predicted value and actual value. In addition, the decrease of the learning rate of the loss function can be avoided in the process of gradient descent [26]. The logarithmic likelihood of uses the negative cross entropy denoted by , where .

The state variables remove transitive dependency in order to simplify the complex reasoning process. Gaussian sampling is used to initialize the state variables denoted by , then . In ST-TM, the encoder is to generate a variational approximate posterior distribution of the document-topic. We assume obeys diagonalization of the covariance matrices of the Gaussians distribution. Then, , where the mean and the covariance are generated by the neural network. Following the above reparametrization trick, we rewrite with an auxiliary parameter that makes . We apply this trick to obtain an estimator of the variational lower bound [27].

For the true posterior distribution, obeys a logarithmic Gaussian distribution with the mean and the variance , then , where the covariance is the hyperparameters, and the mean that is initialized by the Gaussian random sampling to take the place of . According to the probability density function of multivariate Gaussian distribution, the second part of (6) denoted by is inferred as follows:where is the column number of document-topic constraint matrix that is the number of topics in time slice t. In conclusion, the final variational objective can be represented by loss function as follows:where , .

4. Experimental Results and Analysis

In this section, we conduct experiments to verify the efficiency of the proposed method on a real dataset. Our approach is compared with the state-of-the-art topic models.

4.1. Experiment Environment and Dataset

The experiments were executed on a personal computer with an AMD FX-Series CPU FX-8350 @4.0 GHz, the processor having eight physical computing cores, 24 GB DDR3 RAM memory @1600 MHz. The machine was running on Windows 7 (64 bits) operating system, TensorFlow 1.4.0 with CPU support only, and Python 3.6.

To demonstrate the efficiency of our proposed approach, we use the paper corpus of SIGIR (International ACM SIGIR Conference on Research and Development in Information Retrieval) as the dataset. The main purpose of SIGIR is to show new technologies and achievements related to information retrieval. The dataset is constructed by abstracts of papers that are from the 2018, 2013, and 2009 annual conferences. Experimental data consisting of 564 papers with 76 topics and about 90000 words can be downloaded from the official SIGIR website. We divided the dataset into 3 time slices by year.

Our neural networks structure is composed of an encoder and decoder. The encoder has one input layer and two hidden layers, and the decoder has two hidden layers and one output layer. Input layer and output layer have the same number of neurons as the number of papers, which changes in different time slices. Every hidden layer has 100 hidden neurons according to the existing experimental conclusion [8]. Moreover, the learning rate is 0.002 and the number of iterations is 300 that can achieve convergence for our dataset.

4.2. Evaluation Metrics

It is difficult to choose the appropriate evaluation method of the topic model. Although the traditional evaluating method uses perplexity to evaluate the qualitative of topics, it is generally not well suited to express the semantic property of topic terms [28]. Lau et al. [29] use Normalized Pointwise Mutual Information (NPMI) to evaluate the qualitative of topics, and NPMI is close to the ordinary understanding and judgment of people.

The experiment is evaluated by that can express the coherence of topic , and is given aswhere is the number of topic terms in topic . and are the ith topic term and the jth topic term, respectively. , , and are the probability of the ith topic term, the probability of the jth topic term, and the cooccurrence probability of the ith and the jth topic term in topic . Experiment results depend on the average topic coherence of K topics at last. The higher value the average of is, the better the topic model is.

4.3. Baseline Methods

To validate the effectiveness of our method, three state-of-the-art topic model methods have been compared with ST-TM. The first baseline is collapsed Gibbs sampling on the LDA model denoted by SDTM Gibbs [30]. The second one is the static topic model based on autoencoding variational inference denoted by LDA VAE [31]. The last one uses a variational autoencoder to construct a time topic model denoted by DTM VAE [8]. The details of all methods are provided in Table 5. All baseline methods are conducted by the original paper of the corresponding methods.

4.4. Comparison with Baseline Methods

We conduct a preliminary experiment. The convergence process of each method is shown in Figure 6. The convergence of Gibbs sampling depends on the perplexity value which is a measurement of stability for the topic model. The loss function is used to determine the convergence of other methods that are based on a variational autoencoder.

All methods on NPMI value are shown in Figure 7. Our method has the highest NPMI value at 0.316, while the other methods are at 0.083, 0.169, and 0.299, respectively. Comparing to the traditional Gibbs sampling method of constructing the topic model, the variational autoencoder method is more efficient. The supervised time topic model outperforms the static topic model and unsupervised time topic model on the semantic property of topic terms. The result of the preliminary experiment shows that ST-TM has excellent performance at interpretation of the topic.

4.5. Discussion

The proposed method for constructing the supervised time topic model is verified through the public dataset, and the experimental results show that the average of NPMI is improved with time slices and supervision. The proposed method is easier to operate compared with the traditional Gibbs sampling method of constructing the topic model [30] and can achieve a good result. In addition, the proposed method is based on the basic structure of the variational autoencoder method [31] and introduces time slices and supervision for understanding the supervised time topic model. The largest advantage of this work is in solving the problem of constructing a supervised time topic model for some documents with multiple labels and which are time-stamped to achieve document-related tasks.

5. Conclusions

In this paper, we propose a method for constructing a supervised time topic model based on variational autoencoders. We reveal and discuss the limitations of the main current relevant works on the supervised topic models and time topic models, respectively. We further present the reasoning and constructing process of the supervised time topic model. Specifically, the graphical model of constructing process of supervised time topic model is proposed. The implementation of the method depends on a variational autoencoder that is composed of an encoder and decoder. An encoder is to generate a variational approximate posterior distribution of the document-topic, and a decoder estimates the optimal generation probability. The process of reasoning of the marginal likelihood function of generating documents as a variational target is presented. In addition, we compare our method with baselines in the experiment for a preliminary evaluation. The results show that our method is more effective for the topic model than baseline methods. The contribution of this work is in solving the problem of constructing a supervised time topic model for some documents with multiple labels and time-stamped to achieve document-related tasks. The relevant theories of the topic model can be conducive for researchers to understand, reason, and use the supervised time topic model.

In future work, we will improve the method of constructing a supervised time topic model for further completion, refinement, and adaptation. Furthermore, to achieve better performance, we will also apply our constructed supervised time topic model in information retrieval, recommender systems, text classification, and other fields.

Data Availability

According to the funding policy of this work, data cannot be shared or made publicly available during the funding contract.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the Scientific Research Project of Hebei Education Department of China (Grant no. QN2020198) and the Natural Science Foundation of Hebei Province of China (Grant no. F2020207001).