Reference Authors Dataset preprocessing Input (word embedding) [18 ] Rush et al. PTB tokenization by using “#” to replace all digits, converting all letters to lower case, and “UNK” to replace words that occurred fewer than 5 times Bag-of-words of the input sentence embedding [39 ] Chopra et al. PTB tokenization by using “#” to replace all digits, converting all letters to lower case, and “UNK” to replace words that occurred fewer than 5 times Encodes the position information of the input words [55 ] Nallapati et al. Part-of-speech and name-entity tags generating and tokenization (i) Encodes the position information of the input words (ii) The input text was represented using the Word2Vec model with 200 dimensions that was trained using Gigaword corpus (iii) Continuous features such as TF-IDF were represented using bins and one-hot representation for bins (iv) Lookup embedding for part-of-speech tagging and name-entity tagging [52 ] Zhou et al. PTB tokenization by using “#” to replace all digits, converting all letters to lower case, and “UNK” to replace words that occurred fewer than 5 times Word embedding with size equal to 300 [53 ] Cao et al. Normalization and tokenization, using the “#” to replace digits, convert the words to lower case, and “UNK” to replace the least frequent words. GloVe word embedding with dimension size equal to 200 [54 ] Cai et al. Byte pair encoding (BPE) was used in segmentation Transformer [50 ] Adelson et al. Converting the article and their headlines to lower case letters GloVe word embedding [29 ] Lopyrev Tokenization, converting the article and their headlines to lower case letters, using the symbol to replace rare words The input was represented using the distributed representation [38 ] Jobson et al. The word embedding randomly initialised and updated during training while GloVe word embedding was used to represent the words in the second and third models [56 ] See et al. The word embedding of the input for was learned from scratch instead of using a pretrained word embedding model [57 ] Paulus et al. The same as in [55 ] GloVe [58 ] Liu et al. CNN maximum pooling was used to encode the discriminator input sequence [30 ] Song et al. The words were segmented using CoreNLP tool, resolving the coreference and reducing the morphology Convolutional neural network was used to represent the phrases [35 ] Al-Sabahi et al. The word embedding is learned from scratch during training with a dimension of 128 [59 ] Li et al. The same as in [55 ] Learned from scratch during training [60 ] Kryściński et al. The same as in [55 ] Embedding layer with a dimension of 400 [61 ] Yao et al. The word embedding is learned from scratch during training with a dimension of 128 [62 ] Wan et al. No word segmentation Embedding layer learned during training [65 ] Liu et al. BERT [63 ] Wang et al. Using WordPiece tokenizer BERT [64 ] Egonmwan et al. GloVe word embedding with dimension size equal to 300