[Retracted] Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Wan, Boyan; Sohail, Mishal

doi:https://doi.org/10.1155/2022/8703100

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Work Model Conclusion Data Availability Disclosure Conflicts of Interest References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Theories, Technologies, and Applications of Artificial Intelligence in Cloud-Based Internet of Things

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 8703100 | https://doi.org/10.1155/2022/8703100

[Retracted] Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Boyan Wan¹and Mishal Sohail²

Academic Editor: Rashid A Saeed

Received23 Dec 2021

Revised03 Feb 2022

Accepted15 Feb 2022

Published12 Mar 2022

Abstract

Unstructured textual news data is produced every day; analyzing them using an abstractive summarization algorithm provides advanced analytics to decision-makers. Deep learning network with copy mechanism is finding increasing use in abstractive summarization, because copy mechanism allows sequence-to-sequence models to choose words from the input and put them directly into the output. However, since there is no explicit delimiter in Chinese sentences, most existing models for Chinese abstractive summarization can only perform character copy, resulting in inefficiency. To solve this problem, we propose a lexicon-constrained copying network that models multigranularity in both encoder and decoder. On the source side, words and characters are aggregated into the same input memory using a Transformer-based encoder. On the target side, the decoder can copy either a character or a multicharacter word at each time step, and the decoding process is guided by a word-enhanced search algorithm which facilitates the parallel computation and encourages the model to copy more words. Moreover, we adopt a word selector to integrate keyword information. Experiment results on a Chinese social media dataset show that our model can work standalone or with the word selector. Both forms can outperform previous character-based models and achieve competitive performances.

1. Introduction

In recent years, abstractive summarization [1] has made impressive progress with the development of sequence-to-sequence (seq2seq) framework [2, 3]. This framework is composed by an encoder and a decoder. The encoder processes the source text and extracts the necessary information for the decoder, which then predicts each word in the summary. Thanks to their generative nature, abstractive summaries can include novel expressions never seen in the source text, but at the same time, abstractive summaries are more difficult to produce compared with extractive summaries [4, 5] which are formed by directly selecting a subset of the source text.

It has been also found that seq2seq-based abstractive methods usually struggle to generate out-of-vocabulary (OOV) words or rare words, even if those words can be found in the source text. Copy mechanism [6] can alleviate this problem and meanwhile maintain the expressive power of the seq2seq framework. The idea is to allow the decoder not only to generate a summary from scratch but also copy words from the source text.

Though effective in English text summarization, the copy mechanism remains relatively undeveloped in the summarization of some East Asian languages, e.g., Chinese. Generally speaking, abstractive methods for Chinese text summarization come in two varieties, being word-based and character-based. Since there is no explicit delimiter in Chinese sentence to indicate word boundary, the first step of word-based methods [7] is to perform word segmentation [8, 9]. Actually, in order to avoid the segmentation error and to reduce the size of vocabulary, most of the existing methods are character-based [10–12]. When trying to combine the character-based methods in Chinese with copy mechanism, the original “word copy” degrades to “character copy” which does not guarantee a multicharacter word to be copied verbatim from the source text [13]. Unfortunately, copying multicharacter words is quite common in Chinese summarization tasks. Take the Large-Scale Chinese Social Media Text Summarization Dataset (LCSTS) [7] as an example; according to Table 1, about 37% of the words in the summaries are copied from the source texts and consist of multiple characters.

Selective read [13] was proposed to handle this problem. It calculates the weighted sum of encoder states corresponding to the last generated character and adds this result to the input of the next decoding step. Selective read can provide location information of the source text for the decoder and help it to perform the consecutive copy. A disadvantage of this approach, however, is that it increases reliance of present computation on partial results before the current step which makes the model more vulnerable to the error accumulation and leads to exposure bias [14] during inference. Another way to make copied content consecutive is through directly copying text spans. Zhou et al. [15] implement span copy operation by equipping the decoder with a module that predicts the start and end positions of the span. Because a longer span can be decomposed to shorter ones, there are actually many different paths to generate the same summary during inference, but their model is optimized by only the longest common span at each time step during training, which exacerbates the discrepancy between two phases. In this work, we propose a novel lexicon-constrained copying network (LCN). The decoder of LCN can copy either a single character or a text span at a time, and we constrain the text span to match a potential multicharacter word. Specifically, given a text and several off-the-shell word segmentators, if a text span is included in any segmentation result of the text, we consider it as a potential word. By doing so, the number of available spans is significantly reduced, making it viable to marginalize over all possible paths during training. Furthermore, during inference, we aggregate all partial paths on the fly producing the same output using a word-enhanced beam search algorithm, which encourages the model to copy multicharacter words and facilitates the parallel computation.

To be in line with the aforementioned decoder, the encoder should be revised to learn the representations of not only characters but also multicharacter words. In the context of neural machine translation, Su et al. [16] first organized characters and multicharacter words in a directed graph named word lattice. Following Xiao et al. [17], we adopt an encoder based on the Transformer [18] to take the word lattice as input and allow each character and word to have its own hidden representation. By taking into account relative positional information when calculating self-attention, our encoder can capture both global and local dependencies among tokens, providing an informative representation of source text for the decoder to make copy decisions.

Although our model is character-based (because only characters are included in the input vocabulary), it can directly utilize word-level prior knowledge, such as keywords. In our setting, keywords refer to words in the source text that have a high probability of inclusion in the summary. Inspired by Gehrmann et al. [19], we adopt a separate word selector based on the large pretrained language model, e.g., BERT [20], to extract keywords. When the decoder intends to copy words from the source text, those selected keywords will be treated as candidates, and other words will be masked out. Experimental results show that our model can achieve better performance when incorporating with the word selector.

Most existing neural methods to abstractive summarization fall into the sequence to sequence framework. Among them, models based on recurrent neural networks (RNNs) [21–23] are more common than those built on convolutional neural network (CNNs) [1, 24], because the former models can more effectively handle long sequences. Attention [3] is easily integrated with RNNs and CNNs, as it allows the model to focus more on salient parts of the source text [25, 26]. Also, it can serve as a pointer to select words in the source text for copying [6, 13]. In particular, architectures that are constructed entirely of attention, e.g., Transformer [18], can be adopted to capture global dependencies between source text and summary [19].

Prior knowledge has proven helpful for generating informative and readable summaries. Templates that are retrieved from training data can guide summarization process at the sentence level when encoded in conjunction with the source text [27, 28]. Song et al. [29] show that the syntactic structure can help to locate the content that is worth keeping in the summary, such as the main verb. Keywords are commonly used in Chinese text summarization. When the decoder is querying from the source representation, Wang and Ren [30] use the keywords extracted by the unsupervised method to exclude noisy and redundant information. Deng et al. [31] propose a word-based model that not only utilizes keywords in the decoding process but also adds the keywords produced by the generative method into the vocabulary in the hope of alleviating out-of-vocabulary (OOV) problem. Our model is drastically different from the above two models in terms of the way keywords being extracted and encoded.

The most related works are in the field of neural machine translation, in which many researchers resort to the assistance of multigranularity information. On the source side, Su et al. [16] use an RNN-based network to encode the word lattice, an input graph that contains both word and character. Xiao et al. [17] apply the lattice-structured input to the Transformer [18] and generalize the lattice to construct at a subword level. To fully take advantage of multihead attention in the Transformer, Nguyen et al. [32] first partition input sequence to phrase fragments based on -gram type and then allow each head to attend to either one certain -gram type or all different -gram types at the same time. In addition to -gram phrases, the multigranularity self-attention proposed by Hao et al. [33] also attends to syntactic phrases obtained from syntactic trees to enhance structure modeling. On the target side, when the decoder produces an UNK symbol which denotes a rare or unknown word, Luong et al. restore it to a natural word using a character-level component. Srinivasan et al. [34] adopt multiple decoders that map the same input into translations at different subword levels and combine all the translations into the final result, trying to improve the flexibility of the model without losing semantic information. While our model and the above models all utilize multigranularity information, our model differs at that we impose a lexical constraint on both encoding and decoding.

3. Model

This section describes our proposed model in detail.

3.1. Notations

Let character sequence be a source text; we can define a text span that starts with and ends with a potential word if it is contained by any word segmentation result of . Because both characters and words can be regarded as tokens, we include all characters and potential words of the source text in a token sequence .

3.2. Input Representation

Given a token , where is the token length ( when is a character), we first convert it into a sequence of vectors, using the character embedding . Then, a bidirectional Long Short-Term Memory Network (bi-LSTM) is applied to model the token composition:

where denotes the input token representation, which is formed by concatenating the backward state of the beginning character and the forward state of the ending character.

Since the Transformer has no sequential structure, Vaswani et al. [18] proposed positional encoding to explicitly model the order of the sequence. In this work, we assign each token an absolute position which depends on the first character of the token. For example, the absolute position of the word “留学 (studying abroad)” in Figure 1 is the same as that of the character “留 (stay).” By adding the encoding of the absolute position to the token representation, we can get the final input representation .

3.3. Encoder

However, absolute position alone cannot precisely reflect the relationship among tokens. Consider again the example in Figure 1; the distance between the word “留学 (studying abroad)” and the character “生 (life)” is 2 according to their absolute positions, but they are actually neighboring tokens in a certain segmentation. To alleviate this problem, Xiao et al. [17] extend the Transformer [18] by taking into account relation types when calculating self-attention. In this work, we adopt relative position as an alternative to relation type. The main idea is that relative position is complementary to absolute position and can guide each token to interact with other tokens in a coherent manner. Given two tokens and that correspond to a span of the source text each, the position of relative to is determined by both their beginning and ending characters as shown in Table 2. Following Xiao et al. [17], we revise self-attention to integrate relative positional information. Concretely, a self-attention layer consists of heads, which operate in parallel on a sequence of context vectors with dimension . After modification, the resulting output attn for each attention head is defined as where , , and are all model parameters, is the hidden dimension for each head, and and are learned embeddings that encode the position of token relative to token . We concatenate the outputs of all heads to restore their dimension to and then apply other sublayers (such as feed-forward layer) used in the original Transformer [18] to get the final output of the layer. Several identical self-attention layers are stacked to build our encoder. For the first layer, is input representation . For the subsequent layers, is the output of the previous layer.

3.4. Decoder

The encoder proposed by Xiao et al. [17] takes both characters and words as input and thus has the ability to learn multigranularity representations. However, as their decoder is character-based which consumes and outputs only characters, the word representations induced from the encoder cannot receive supervision signal directly from the decoder and remain a subsidiary part of the input memory. To alleviate this problem, we extend the standard Transformer decoder by a lexicon-constrained copying module, which not only allows the decoder to perform multicharacter word copy but also provides auxiliary supervision on word representations. Specifically, at each time step , we leverage a single-head attention over the input memory and the decoder hidden state to produce copy distribution and context vector :

In addition to the predefined character vocabulary and a special token UNK denoting any out-of-vocabulary token, lexicon-constrained copying module expands the output space by two sets and , consisting of characters and multicharacter words that appear in the source text, respectively, so that the probability of emitting any token at time step is where and control the decoder switching between generation mode and copy mode and provide a probability distribution over character vocabulary : where is the sigmoid function.

With the introduction of lexicon-constrained copying module, our decoder can predict tokens of variable lengths at each time step and thereby can generate any segmentation of a sentence. Naturally, we expect to evaluate the probability of a summary by marginalizing over all its segmentations. For example, the probability of a summary consisting of only one word “北京 (BeiJing)” can be factorized as where each term corresponds to a segmentation and is the product of conditional probabilities; we use to denote either the beginning or end of a sentence. Note that the conditional probability here depends on the current segmentation, which means the decoder directly take tokens in a segmentation as input. However, if we feed the decoder with character-level input and reformulate the conditional probability, the above probability can be rewritten as where we factor out , because it is shared by two different segmentations. As can be seen from the above example, the assumption that conditional probability of a token depends only on its preceding character sequence facilitates the reuse of computation and thus makes it feasible to apply dynamic programming. Formally, let character sequence be a summary; its probability can be represented as a recursion:

Note that all the above are inevitably conditioned on the source text, so we omit it for simplicity. We train the model by maximizing for all training pair in the dataset.

During inference, since there is no access to the ground truth, we need a search algorithm which can guide the generation of the summary in a left-to-right fashion. Beam search [35] is the most common search algorithm in seq2seq frameworks, but cannot be used directly in our scenario. To illustrate this, we first define hypothesis as a partial output that consists of tokens. Hypotheses can be further divided into character hypotheses and word hypotheses based on whether their last token is a character or a multicharacter word. For hypotheses within a beam, the standard beam search algorithm updates states for them by feeding their last tokens to the decoder and then generates new hypotheses through suffixing them with a token sampled from the model’s distribution. Because our decoder is designed to take only characters as input, multiple decoder steps are required to update the state for a word hypothesis. As a result, it is difficult to conduct batched update for a beam containing both word hypotheses and character hypotheses. To this end, we proposed a novel word-enhanced beam search algorithm, where the beam is also split into two parts: the character beam and the word beam. The word beam is used to update the states for word hypotheses. When their states are fully updated, word hypotheses are placed into the character beam (see lines 5-8 of Algorithm 1). Note that we do not perform generation step for word hypotheses in the word beam, that is to say, with the same length, the more multicharacter words a hypothesis includes, the more generation steps it can skip, which may give it a higher probability.

Input: model, source, maximum summary length , beam size
1:
2: and
3: for; ; do
4:
5:
//Batched update for hypos in both and
6: fordo
7: if hyp is Updated then
8: move hyp into
9: end if
10: end for
11: Merge hypos of the same character sequence in
12:
//Generating new hypotheses and their respective
log probabilities from
13: fordo
14: if hyp ends with a multicharacter word then
15: Move hyp from into
16: end if
17: end for
18:
19:
20: fordo
21: if hyp ends with or then
22: Move hyp from into
23: end if
24: end for
25: end for
26: hyp.avgLogProb
27: return finalHyp

3.5. Word Selector

We treat keyword selection as a binary classification task on each potential word. To obtain word representations, we first leverage BERT [20], a pretrained language model to produce context-aware representations for all characters in the source text and then feed them to a bi-LSTM network. Different from Section 3.2 where bi-LSTM is applied to character sequence of each word, here the bi-LSTM takes the whole source character representations as input, in an attempt to build word representation that can reflect how much contextual information the word carries. Given a potential word , where and are indexes of characters in the source text, we can calculate its final representation as follows: Then, a linear transformation layer and a sigmoid function can be used sequentially on its final representation to compute the probability of being selected.

During training, words that appear in both summary and source text are considered as positives, and the rest are negatives. To make sure that decoder can access the entire source character sequence at inference time, in addition to the multicharacter words with the probability, we treat all characters in source text as keywords. Inspired by [19], we utilize keyword information by masking out other words when calculating copy distribution. In particular, we leave in Equation (4) unchanged for keywords and set to zero for the rest of the words.

4. Experiments

4.1. Datasets and Evaluation Metric

We conduct experiments on the Large-Scale Chinese Social Media Text Summarization Dataset (LCSTS) [7], which consists of source texts with no more than 140 characters, along with human-generated summaries. The dataset is divided into three parts; the (source, summary) pairs in PART II and PART III are scored manually from 1 to 5, with higher scores indicating more relevance between the source text and its summary. Following Hu et al. [7], after removing pairs with scores less than 3, PART I, PART II, and PART III are used as training set, verification set, and test set, respectively, with 2.4M pairs in PART I, 8K pairs in PART II, and 0.7K pairs in PART III.

We choose ROUGE score [36] as our evaluation metric, which is widely used for evaluating automatically produced summaries. The metric measure the relevance between a source text and its summary based on their cooccurrence statistics. In particular, ROUGE-1 and ROUGE-2 depend on unigram and bigram overlap, respectively, while ROUGE-L relies on the longest common subsequence.

4.2. Experimental Setup

The character vocabulary is formed by 4000 most frequent characters in the training set. To get all potential words, we use PKUSEG [37], a toolkit for multidomain Chinese word segmentation. Specifically, there are separate segmentators for four domains, including web, news, medicine, and tourism. We use these segmentators for the source text, and if a text span is included in any of word segmentation results, we regard it as a potential word.

For the lexicon-constrained copying network, we employ six attention layers of 8 heads for both encoder and decoder. Constant in Table 2 is set to 8. We make character embedding and all hidden vectors the same dimension of 512 and set the filter size of feed-forward layers to 1024. For word selectors, we use a single-layer Bi-LSTM with a hidden size of 512.

During training, we update the parameters of the lexicon-constrained copying network (LCCN) and word selector by Adam optimizer with , , and . The same learning rate schedule of Vaswani et al. [18] is used to the LCCN, while a fixed learning rate of 0.0003 is set for word selector. The BERT we use in the word selector is pretrained on Chinese corpus by Wolf et al. [38] and we freeze its parameters throughout the training.

During testing, we use a beam size of 10 and take the first 10 multicharacter words predicted by the word selector and all characters in the source text as keywords.

4.3. Baselines

(i)RNN and RNN-Context are seq2seq baselines provided along with the LCSTS dataset by Hu et al. [7]. Both of them have GRU encoder and GRU decoder, while RNN-Context has an additional attention mechanism(ii)COPYNET integrates copying mechanism into the seq2seq framework, trying to improve both content-based addressing and location-based addressing(iii)Supervision with Autoencoder (superAE) uses an autoencoder trained on the summaries to provide auxiliary supervision for the internal representation of seq2seq. Moreover, adversarial learning is adopted to enhance this supervision(iv)Global Encoding refines source representation with consideration of the global context by using a convolutional gated unit(v)Keyword and Generated Word Attention (KGWA) exploits relevant keywords and previously generated words to learn accurate source representation and to alleviate the information loss problem(vi)Keyword Extraction and Summary Generation (KESG) first uses a separate seq2seq model to extract keywords and then utilize keyword information to improve the quality of the summarization(vii)Transformer and CopyTransformer are our implementations of the Transformer framework in the task of summarization. Copy mechanism is incorporated into CopyTransformer

4.4. Results

Table 3 records the results of our LCCN model and other seq2seq models on the LCSTS dataset. To begin with, we first compare two Transformer baselines. We can see that CopyTransformer outperforms vanilla Transformer by 0.8 ROUGE-1, 0.6 ROUGE-2, and 0.3 ROUGE-L, showing the importance of copy mechanism. The gap between our LCCN and vanilla Transformer is further widened to 1.8 ROUGE-1, 2.1 ROUGE-2, and 2.5 ROUGE-L, which asserts the superiority of lexicon-constrained copying over character-based copying. Compared to other latest models, our LCCN can achieve state-of-the-art performance in terms of ROUGE-1 and ROUGE-2 and is second only to the KGWA in terms of ROUGE-L. When also using keyword information as the KGWA does, LCCN+word selector further improves the performance and overtakes the KGWA by 0.2 ROUGE-L. We also conduct an ablation study by removing the word-enhanced beam search in LCCN, denoted by w/o word-enhanced beam search in Table 3. It shows that word-enhanced beam search can boost the performance of 1.7 ROUGE-1, 1.0 ROUGE-2, and 0.9 ROUGE-L.

4.5. Discussion

Similar to extractive summarization, we can use the top extracted keywords to form a summary, which then can be used to evaluate the quality of keywords. The first entry of Table 4 shows the performance when the keywords are extracted by TF-IDF [39], a numerical statistic method that relies on the frequency of the word. The second entry shows the performance when we determine keywords based on the source representation learned by the encoder of LCCN. As can be seen from the last entry, word selector outperforms two methods mentioned above by a large margin, indicating the importance of external knowledge brought by the BERT.

Given a source text which describes the criminal case, we show its summaries generated by different models in Table 5. It is clear that the suspect of this case is a high school student and the victim is his baby daughter. However, the summary generated by the Transformer mistakes the high school student’s parents as victims and claims that the crime took place in the street, which is not mentioned in the source text. The summary of the CopyTransformer also makes a fundamental mistake, resulting the mismatch between the adjective “17岁 (17-year-old)” and the noun “女婴儿 (female infant)”. Compared with them, the summary of our LCCN is more faithful to the source text and contains the correct suspect and victim, i.e., “高中生 (high school student)” and “婴儿 (infant),” which are copied from the source text through only two decoder steps. With the help of word selector, our summary can further include the keyword “害怕 (fear)” to indicate the criminal motive.

Compared with character-based models, our LCCN uses fewer steps to output a summary, so it should be able to reduce the possibility of repetition. To prove it, we record the percentage of -gram duplicates for summaries generated by different models in Table 2. The results as shown in Figure 2 show that our model can indeed alleviate the repetition problem, and we also notice that the repetition rate of the LCCN+word selector is slightly higher than that of LCNN, which may be due to the smaller output space after adding the word selector.

5. Conclusion

In this paper, we propose a novel lexicon-constrained copying network for Chinese summarization. Querying the multigranularity representation learned by our encoder, our decoder can copy either a character or a multicharacter word at each time step. Experiments on the LCSTS dataset show that our model is superior to the Transformer baselines and quite competitive with the latest models. With the help of keyword information provided by the word selector, it can even achieve state-of-the-art performance. In the future, we plan to apply our model to other tasks, such as comment generation, and to other languages, such as English.

Data Availability

The data used to support the findings of this study are included within the article.

Disclosure

A preprint has previously been published [40].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 379–389, Lisbon, Portugal, 2015.
View at: Google Scholar
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pp. 3104–3112, Montreal, Quebec, Canada, 2014.
View at: Google Scholar
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015.
View at: Google Scholar
A. Sharan, H. Imran, and M. L. Joshi, “A trainable document summarizer using Bayesian classifier approach,” in First International Conference on Emerging Trends in Engineering and Technology, ICETET ‘08, pp. 1206–1211, Nagpur, Maharashtra, India, 2008.
View at: Publisher Site | Google Scholar
H. Saggion and T. Poibeau, “Automatic text summarization: past, present and future,” Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing, Springer, Berlin, Heidelberg, pp. 3–21, 2013.
View at: Publisher Site | Google Scholar
C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio, “Pointing the unknown words,” in In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 140–149, Berlin, Germany, 2016.
View at: Google Scholar
B. Hu, Q. Chen, and F. Zhu, “LCSTS: a large scale Chinese short text summarization dataset,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pp. 1967–1972, Lisbon, Portugal, 2015.
View at: Google Scholar
H. Zhao and C. Kit, “Integrating unsupervised and supervised word segmentation: the role of goodness measures,” Information Sciences, vol. 181, no. 1, pp. 163–183, 2011.
View at: Publisher Site | Google Scholar
D. Cai, H. Zhao, Z. Zhang, Y. Xin, Y. Wu, and F. Huang, “Fast and accurate neural word segmentation for Chinese,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pp. 608–615, Vancouver, Canada, 2017.
View at: Google Scholar
J. Lin, X. Sun, S. Ma, and Q. Su, “Global encoding for abstractive summarization,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pp. 163–169, Melbourne, Australia, 2018.
View at: Google Scholar
S. Ma, X. Sun, J. Lin, and H. Wang, “Autoencoder as assistant supervisor: improving text representation for Chinese social media text summarization,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pp. 725–731, Melbourne, Australia, 2018.
View at: Google Scholar
X. Duan, H. Yu, M. Yin, M. Zhang, W. Luo, and Y. Zhang, “Contrastive attention mechanism for abstractive sentence summarization,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pp. 3042–3051, Hong Kong, China, 2019.
View at: Google Scholar
J. Gu, Z. Lu, H. Li, and V. O. K. Li, “Incorporating copying mechanism in sequence-to-sequence learning,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Berlin, Germany, 2016.
View at: Google Scholar
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2016.
View at: Google Scholar
Q. Zhou, N. Yang, F. Wei, and M. Zhou, “Sequential copying networks,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), pp. 4987–4995, New Orleans, Louisiana, USA, 2018.
View at: Google Scholar
J. Su, Z. Tan, D. Xiong, R. Ji, X. Shi, and Y. Liu, “Lattice-based recurrent neural network encoders for neural machine translation,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 3302–3308, San Francisco, California, USA, 2017.
View at: Google Scholar
F. Xiao, J. Li, H. Zhao, R. Wang, and K. Chen, “Latticebased transformer encoder for neural machine translation,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 3090–3097, Florence, Italy, 2019.
View at: Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5998–6008, Long Beach, CA, USA, 2017.
View at: Google Scholar
S. Gehrmann, Y. Deng, and A. M. Rush, “Bottom-up abstractive summarization,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4098–4109, Brussels, Belgium, 2018.
View at: Google Scholar
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: pre-training of deep bidirectional transformers for language understanding,” 2018, https://arxiv.org/abs/1810.04805.
View at: Google Scholar
S. Chopra, M. Auli, and A. M. Rush, “Abstractive sentence summarization with attentive recurrent neural networks,” in NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98, San Diego California, USA, 2016.
View at: Google Scholar
R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang, “Abstractive text summarization using sequenceto-sequence RNNs and beyond,” in Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290, Berlin, Germany, 2016.
View at: Google Scholar
P. Li, W. Lam, L. Bing, and Z. Wang, “Deep recurrent generative decoder for abstractive text summarization,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pp. 2091–2100, Copenhagen, Denmark, 2017.
View at: Google Scholar
J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning, pp. 1243–1252, Sydney NSW, Australia, 2017.
View at: Google Scholar
A. Çelikyilmaz, A. Bosselut, X. He, and Y. Choi, “Deep communicating agents for abstractive summarization,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, pp. 1662–1675, New Orleans, Louisiana, USA, 2018.
View at: Google Scholar
A. Cohan, F. Dernoncourt, D. S. Kim et al., “A discourse-aware attention model for abstractive summarization of long documents,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pp. 615–621, New Orleans, Louisiana, USA, 2018.
View at: Google Scholar
K. Wang, X. Quan, and R. Wang, “Biset: bi-directional selective encoding with template for abstractive summarization,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 2153–2162, Florence, Italy, 2019.
View at: Google Scholar
Z. Cao, W. Li, S. Li, and F. Wei, “Retrieve, rerank and rewrite: soft template based neural summarization,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pp. 152–161, Melbourne, Australia, 2018.
View at: Google Scholar
K. Song, L. Zhao, and F. Liu, “Structure-infused copy mechanisms for abstractive summarization,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, pp. 1717–1729, Santa Fe, New Mexico, USA, 2018.
View at: Google Scholar
M. Talha, M. Sohail, and H. Hajji, “Analysis of research on amazon AWS cloud computing seller data security,” International Journal of Research in Engineering Innovation, vol. 4, no. 3, pp. 131–136, 2020.
View at: Publisher Site | Google Scholar
M. Talha, M. Sohail, R. Tariq, and M. T. Ahmad, “Impact of oil prices, energy consumption and economic growth on the inflation rate in Malaysia,” Cuadernos de Economía, vol. 44, no. 124, pp. 26–32, 2021.
View at: Google Scholar
Q. Wang and J. Ren, “Abstractive summarization with keyword and generated word attention,” in International Joint Conference on Neural Networks, IJCNN 2019 Budapest, pp. 1–8, Hungary, 2019.
View at: Google Scholar
Z. Deng, F. Ma, R. Lan, W. Huang, and X. Luo, “A two-stage Chinese text summarization algorithm using keyword information and adversarial learning,” Neurocomputing, vol. 425, pp. 117–126, 2021.
View at: Google Scholar
P. X. Nguyen and S. R. Joty, “Phrase-based attentions,” 2018, https://arxiv.org/abs/1810.03444.
View at: Google Scholar
J. Hao, X. Wang, S. Shi, J. Zhang, and Z. Tu, “Multi-granularity self-attention for neural machine translation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pp. 887–897, Hong Kong, China, 2019.
View at: Google Scholar
T. Srinivasan, R. Sanabria, and F. Metze, “Multitask learning for different subword segmentations in neural machine translation,” 2019, https://arxiv.org/abs/1910.12368.
View at: Google Scholar
F. J. Och and H. Ney, “The alignment template approach to statistical machine translation,” Computational Linguistics, vol. 30, no. 4, pp. 417–449, 2004.
View at: Publisher Site | Google Scholar
C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries using -gram co-occurrence statistics,” in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157, Edmonton, Canada, 2003.
View at: Google Scholar
R. Luo, J. Xu, Y. Zhang, X. Ren, and X. Sun, “Pkuseg: a toolkit for multi-domain Chinese word segmentation,” 2019, https://arxiv.org/abs/1906.11455.
View at: Google Scholar
B. Wan, Z. Tang, and L. Yang, “Lexicon-constrained copying network for Chinese abstractive summarization,” 2020, https://arxiv.org/abs/2010.08197.
View at: Google Scholar

Copyright

Copyright © 2022 Boyan Wan and Mishal Sohail. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

281

Downloads

366

Citations

Wireless Communications and Mobile Computing

Theories, Technologies, and Applications of Artificial Intelligence in Cloud-Based Internet of Things

[Retracted] Text Mining Based on the Lexicon-Constrained Network in the Context of Big Data

Abstract

1. Introduction

2. Related Work

3. Model

3.1. Notations

3.2. Input Representation

3.3. Encoder

3.4. Decoder

3.5. Word Selector

4. Experiments

4.1. Datasets and Evaluation Metric

4.2. Experimental Setup

4.3. Baselines

4.4. Results

4.5. Discussion

5. Conclusion

Data Availability

Disclosure

Conflicts of Interest

References

Copyright