Abstract

Short text similarity computation plays an important role in various natural language processing tasks. Siamese neural networks are widely used in short text similarity calculation. However, due to the complexity of syntax and the correlation between words, siamese networks alone cannot achieve satisfactory results. Many studies show that the use of an attention mechanism will improve the impact of key features that can be utilized to measure sentence similarity. In this paper, a similarity calculation method is proposed which combines semantics and a headword attention mechanism. First, a BiGRU model is utilized to extract contextual information. After obtaining the headword set, the semantically enhanced representations of the two sentences are obtained through an attention mechanism and character splicing. Finally, we use a one-dimensional convolutional neural network to fuse the word embedding information with the contextual information. The experimental results on the ATEC and MSRP datasets show that the recall and F1 values of the proposed model are significantly improved through the introduction of the headword attention mechanism.

1. Introduction

In machine learning, text similarity is a type of similarity learning, and is a hot research area in the field of natural language processing (NLP). Its influence in several fields such as question answering systems, information retrieval, machine translation, and text classification is becoming increasingly significant [1]. For example, the calculation of the matching degree between query items and documents in retrieval systems and of question and candidate answers in question answering systems are based on text similarity. So, research on semantic similarity calculation is highly significant for the development of NLP-based systems.

However, the calculation of text similarity is a challenging task. As a few short words can contain complex and subtle content, anthropological linguistics is a very esoteric subject. Seemingly different sentences can express very similar semantics, so text should not only be analyzed on different degrees of granularity but also on a deeper level within specific linguistic contexts. Previous studies were limited to the use of traditional statistic models for text similarity calculations, such as the Term Frequency-Inverse Document Frequency (TF-IDF) model based on literal matching, the BM25 model, and latent semantic analysis based on semantic matching [24]. However, these models are based on keyword information for matching, which only allows the extraction of shallow information and ignores deep semantic information [5]. Methods based on neural network models use word2vec and other methods to convert words into word vectors, train the model to obtain the feature representation of the sentence, and then use fully connected layers or editing distance equations to calculate the similarity. Hu et al. [6] used convolutional neural networks to model two sentences, and calculated their similarity through the extracted semantic vectors. Sundermeyer et al. [7] applied a long short-term memory (LSTM) to the field for literary NLP. LSTMs solve the problem of traditional recurrent neural networks for long-distance information dependencies of input sequences. Zhu et al. [8] proposed a bidirectional LSTM network based on a siamese network structure to calculate text similarity; their network traverses the entire text using two LSTM models and comprehensively considers the context information accompanying each word.

In the field of deep learning, current methods for comparing the similarity of two sentences are mainly divided into three types: siamese network frameworks, interactive network frameworks, and pretrained models [9]. The common approach involving siamese networks is to evaluate sentence similarity by mapping the two sentences through the same encoder, comparing them, and evaluating their similarity through the calculation of a loss function [1013]. This method of using siamese networks to share the parameters can reduce the computation time greatly but does not take into account the interactive relationship between the sentence encoding vectors. It is also difficult to measure the contextual importance of words, which results in poor accuracy. Studies on interactive network frameworks dealing with text similarity include ESIM, BiMPM, and DIIN [14, 15]. In these approaches, the two sentences are first encoded using neural network, the similarity between word sequences in the text is calculated through some complex attention mechanism to formulate an interaction matrix, and the interaction information is finally integrated. However, global information such as syntax and inter-sentence relationships is ignored. Using pretrained models for text similarity-related tasks can lead to good results, as demonstrated by BERT [16] and XLNet [17]. These models are trained on a large-scale corpus and then fine-tuned on a target dataset pertaining to a specific field. However, the pretrained models have too many parameters and it is difficult to change the network structure, which limits their applicability.

The attention mechanism can be abstracted to improve the attention focus on a specific part of the data. The attention mechanism was first adopted to the image processing field to allow focusing on key information in specific image regions. Bahdanau et al. [18] first introduced the attention mechanism into natural language processing tasks, aiming to align the output of the target end with the input of the source end to improve the accuracy machine translation. Subsequently, scholars have proposed various attention mechanisms for different tasks. For example, Cheng et al. [19] proposed a one-way self-attention mechanism in reading tasks to analyze the correlation between current and previous words. He et al. [20] and Shan et al. [21] found that in recommendation systems, an attention mechanism can capture the long-term and short-term interest of users effectively and improve the accuracy of the system. Tan et al. [22] used an attention mechanism based on BiLSTM and CNN to represent separately the question and candidate answers semantically in a Q&A system and answer selection tasks, and used cosine similarity for fusion. The results showed that only a word-level attention mechanism leads to good results.

The main contributions of this paper can be summarized as follows: 1. A semantic similarity calculation method is proposed based on the siamese network structure and combining a convolutional neural network (CNN) and a bidirectional gated recurrent unit (BiGRU). The BiGRU network is used to extract contextual information, and then the CNN network is used to fuse the word embedding information with the contextual information. 2. An attention mechanism based on the headword is proposed, and the output of the BiGRU is weighted and updated to enhance the influence of the headword of the sentence.

2. Methods

The proposed HA-RCNN model for calculating text similarity consists of three components: (1) A sentence encoder. We use a BiGRU to extract the contextual information and combine it with the word embedding information to obtain a representation of each word in the sentence. (2) A headword-based attention mechanism. We use the nouns or verbs that reflect the main information of the sentence as headwords. After obtaining the set of headwords, the output of the BiGRU is weighted and updated. (3) Information fusion. In this part, the word sequences obtained after splicing are fused. Finally, we use cosine similarity as the evaluation function to determine the similarity of the two texts. Figure 1 gives an illustration of the proposed HA-RCNN model.

2.1. Sentence Encoding

Recurrent neural networks (RNNs) are the most common and effective method for dealing with sequences [23]. Through the interconnection between the nodes of the hidden layer, the previous memory is factored in the current output to capture contextual information. However, gradient disappearance and gradient explosion may occur during the training process, so only a small amount of context information can be captured. GRU networks use different functions to control the state of the hidden layer and screen useful information in the sequence, which avoids the gradient explosion problem.

The GRU is a variant of LSTM. Compared with LSTMs, GRU models have a simpler network structure, but their effect is the same as that of LSTM, which leads to greatly reduced training times. GRUs merge the input gate and the forget gate into a single structure called the update gate.

GRUs use a gating mechanism to control input, memory, and other information to make predictions at the current time step. A GRU has two gates, a reset gate and an update gate. Intuitively speaking, the reset gate determines how to combine the new inputs with previous memory, while the update gate defines the amount of previous memory taken into account for the current time step. The special feature of these two gating mechanisms is that they can preserve the information contained in long-term sequences, which will not be lost over time if it is not relevant to the current prediction. If the reset gate is set to 1 and the update gate is set to 0, a standard RNN model is obtained. The update equation of the GRU is as follows:where is the current input; is a sigmoid function; and are the hidden states at the previous and current moment, respectively; is the candidate state at the current moment; and is the current output. Equations (1) and (2) pertain to the update and reset gate, respectively.

In GRUs, information can only be transmitted one-way. In practice, each word may have a dependency relationship with words in its context, so in this paper a BiGRU network is adopted. A BiGRU is composed of a forward and a backward GRU. It traverses the text in two directions and obtains contextual information bidirectionally, thus overcoming the single-direction processing limitation of plain GRUs. The process is shown in Figure 2.

The sentence sequence is obtained through the embedding layer, where is the sentence length, and is the i-th word in the sentence. and represent the contextual information on the left and right side of word , respectively. and are obtained by training the forward and backward GRU, respectively, as shown below:

In the above equations, represents the word embedding of word ; and represents the vector representation of the contextual information on the left side of ; represents the transformation matrix of the contextual information vector; and is the matrix that combines the current word vector with the left contextual vector of the next word. is calculated in a similar way.

After extracting the context information using the BiGRU, the contextual information and word embedding information are spliced together. Finally, we obtain the semantic representation of the i-th word in the word sequence as .

2.2. Headword-Based Attention Mechanism

In previous studies, attention mechanisms were used to enhance the expression of local information. However, these mechanisms usually take into account the number of occurrences of certain words in a sentence from a traditional statistical perspective, resulting in an increase in the weight of some unimportant words.

Our approach is based on the assumption that the nouns or verbs in the sentence reflect the main information of the sentence, and consider them as headwords. For example, the information expressed in the sentence “Does Ant Check Later require a credit check?” is mainly expressed through the words “require,” “Ant Check Later,” and “credit check.” In the sentence “When will the deposit rate go up?”, the information is expressed mainly through “go up” and “deposit rate.”

To obtain the headwords, we use the Language Technology Platform (LTP) to analyze the sentence syntactically. As an example, for the sentence “How do I apply for quota in Huabei?,” we obtain the result shown in Figure 3.

The meanings of corresponding tags are shown in Table 1.

After analysis, “apply” is identified as the main verb and is extracted as of the sentence. If the subject or object of is a noun or a noun phrase, it is assigned as the primary noun . If the rhetorical and juxtaposed elements of also contain nouns, they are also added to . and form the headword of the sentence. In addition, if the main verb cannot be extracted through syntactic analysis, the headword is extracted directly through the part of speech. Therefore, there may be multiple instances. For example, in Figure 3, the subject of “apply” is “I” and the object is “quota.” Because “I” is a personal pronoun rather than a noun or noun phrase, the object “quota” is a noun, and the noun “Huabei” is a modifier of “quota,” the headwords of the sentence in Figure 3 are {apply, quota, and Huabei}.

After the headwords have been obtained, they are denoted as , where is the number of words in the set. We use to update the weighting output of the forward () and backward () GRUs. Specifically, for each vector in , the similarity with is calculated separately to obtain the maximum value . The calculation method is as follows:

By updating with , we obtain the information enhancement representation based on the attention of the headwords.

2.3. Information Fusion

CNNs extract the local information of the text through a fixed-size convolution kernel, and use a pooling layer to reduce the amount of calculation and retain key information. Because the convolution kernel has a fixed window, it is always possible that some important information will be lost. Although this problem can be solved using multiple windows of different sizes, this solution will lead to an increase of the number of calculations.

We use a one-dimensional CNN network (1DCNN) to fuse the information of the spliced word sequences. The calculation process is as follows:where represents the feature representation corresponding to after 1D convolution processing. The calculation process is shown in Figure 4.

Finally, after obtaining the vector representation of the two sentences, we use their cosine distance to determine whether the two texts are semantically similar. The corresponding equation is:

3. Experiment

We conducted experiments to demonstrate the effectiveness of the proposed HA-RCNN model. In this section, the experimental datasets and evaluation criteria are first introduced, followed by a detailed analysis of the experimental results.

3.1. Datasets

Two datasets were used to verify the performance of the model, as follows:(a)The Ant Financial NLP Competition (ATEC) dataset was obtained from Ant Financial’s 2018 competition. Each pair of sentences in the dataset comes from questions received by an intelligent customer service and was labeled with “1” to indicate that the two sentences are semantically similar, and 0 when the sentences were not similar.(b)The Microsoft Research Paraphrase Corpus (MSRP) is a collection of sentence pairs obtained from news on the web. As in the ATEC dataset, each pair of sentences was labeled with a 0 or a 1 for dissimilarity or similarity, respectively.

In the ATEC dataset, the training set contains 100,000 sentence pairs and the test set contains 10,000 sentence pairs. During preprocessing, we found that the ratio of positive and negative samples in ATEC was significantly unbalanced at about 4.5 : 1. In order to avoid the impact of sample imbalance on the experiment, we selected 32250 pairs of sentences for training and 6450 pairs of sentences for testing, with positive and negative samples accounting for half of each subset. The MSRP dataset contains 5803 sentence pairs, including 4077 pairs in the training set and 1726 pairs in the test set. Due to the small number of samples in MSRP, we did not segment the dataset. The standard format of the two datasets is shown in Table 2.

In the experiments, we used accuracy, precision, recall, and F1 as the evaluation criteria, calculated as follows:where TP is the number of positive samples predicted as positive samples; TN is the number of negative samples predicted as negative samples; FP is the number of negative samples predicted as positive samples; and FN is the number of positive samples predicted as negative samples.

3.2. Experimental Results and Analysis

In order to prove the effectiveness of HA-RCNN, we compared it with state-of-the-art models used for the same application.

MMNF [24]: This model uses the Jaccard coefficient based on the part of speech, TF-IDF and the Word2Vec-CNN model to measure sentence similarity through weighted calculation.

BiGRU + Dilated [25]: This model uses constituency parsing and dilated convolution to reduce the missing elements in long sentences and increase the important information in short sentences. At the same time, the receptive field is extended to capture semantic relevance in two-dimensional space.

Tree-lSTM [26]: The model uses a control input to model the relationship between the two inputs. To calculate the sentences’ similarity, their semantic representation is embedded into a dense vector through syntactic parsing and other operations.

CNN-lSTM [27]: This model is based on the siamese neural network structure. A CNN and LSTM are used to obtain the local and global information of the text, respectively.

Multi-Feature [28]: This model evaluates the similarity of two sentences in terms of words, word order, and word vectors, and introduces word vectors in traditional statistical-based discriminative method to make judgments, taking into account the structural information of the sentences.

The results of the comparative experiment are shown in Figures 5 and 6.

As can be seen from the data in the diagram, the performance of the CNN-LSTM model is poor. Although the text is analyzed from both local and global perspectives, the model focuses only on few factors and ignores the influence of sentence structure and syntactic information. Compared with the CNN-LSTM model, the Tree-LSTM and BIGRU + Dialated control word vectors through operations such as syntax analysis, taking into account the influence of sentence structure on text similarity evaluation. The MMNF model extracts text features by combining different machine learning algorithms and a CNN network, maximizing the performance of the model through a continuous adjustment of the weights. Its accuracy is the highest in several comparative experiments. The recall rate and F1-value of the HA-RCNN model are better than those of the other models, but its accuracy is worse than that of the MMNF model. The comparison of the F1-values shows that the HA-RCNN model achieves excellent results on the ATEC dataset.

On the MSRP dataset, the performance of the BiGRU + Dialated and CNN-LSTM models was poor. The Multi-Feature model performed well due to its combination of multiple parameters, aided by the use of word vectors. The recall rate of HA-RCNN model was again the highest, and its F1-value was also excellent.

It can be seen that the HA-RCNN model lags behind the other two models in accuracy. By observing the confusion matrix of the experiment, we see that the number of TP samples is small, which leads to the low accuracy of the model. We believe that there are two reasons for this phenomenon. First, the sample size of the training and test sets provided by the MSRP dataset is small, and the proportion of positive and negative samples in the dataset is unbalanced. This makes it difficult to achieve accurate classification of positive sample data. Another reason is that the model in this paper is more complex and it is difficult to train the desired effect in small datasets.

We replaced the BiGRU in the model with a BiLSTM and observed the change in model performance on the ATEC dataset, as shown in Figure 7.

It is evident that the model performance is hardly affected by the change, but the model is trained faster.

In order to verify the effectiveness of the proposed method, we introduce the attention mechanism proposed by other scholars into the model of this paper for comparison. The results are shown in Table 3.

These three attention mechanisms allow the model to learn to determine the importance of each word by increasing the weight of important words and reducing the weight of unimportant words. To avoid interfering with weight assignment, no lexical rules were introduced.

To test whether the proposed attention mechanism can capture important information in sentences, we randomly selected a pair of sentences from the dataset and observed how they affected the output of the mechanism. The results are shown in Figure 8.

It can be seen that after the attention mechanism, the semantic expression of the important elements in the sentence is enhanced, as is the correlation between them and other important elements. However, non-important elements of the sentence have little effect.

4. Conclusion

In this paper, we proposed a model based on a siamese network, integrating semantic information and a headword attention mechanism to learn sentence representations. Our model obtains a semantically enhanced representation through the headword attention mechanism, which increases the influence of key information in the sentence. In order to verify the performance of the model, we conducted experiments on the ATEC and MSRP datasets. Compared with other models, our model achieved relatively excellent performance in the recall and F1 metrics.

In future work, we will study the impact of multi-level attention mechanisms on model performance, and incorporate external knowledge into our model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The Fundamental Research Funds for Central Universities (Grant no. 2572015CB32) and National Natural Science Foundation of China (Grant no. 61901103).