Abstract

Sentence similarity calculation is one of the important foundations of natural language processing. The existing sentence similarity calculation measurements are based on either shallow semantics with the limitation of inadequately capturing latent semantics information or deep learning algorithms with the limitation of supervision. In this paper, we improve the traditional tolerance rough set model, with the advantages of lower time complexity and becoming incremental compared to the traditional one. And then we propose a sentence similarity computation model from the perspective of uncertainty of text data based on the probabilistic tolerance rough set model. It has the ability of mining latent semantics information and is unsupervised. Experiments on SICK2014 task and STSbenchmark dataset to calculate sentence similarity identify a significant and efficient performance of our model.

1. Introduction

With the rapid development of information technique, innumerable text data are continuously growing. Unlike digital data, the processing of text data is more complex and difficult. Sentence similarity aims at calculating the degree of resemblance or distance between two sentences. It plays an important role in the application of natural language processing (NLP), like text summarization [1, 2], machine translation [3], question answering systems [4], and information retrieval [5]. These applications are based on sentence similarity to a certain extent, whose development makes the research of sentence similarity become urgent.

Text data are characterized by uncertainty, inaccuracy, and incompleteness. Existing sentence similarity computation methods are almost all based on the relation among words and words in the sentences or based on the deep learning algorithms. Methods based on the relation among the words and words such as word cooccurrence mainly consider sentence semantics from the shallow level and cannot capture the latent semantics information behind the sentences. Methods based on the deep learning algorithms such as convolutional neural network (CNN) can capture deep semantics information, but most of them are with high time complexity and supervision. In addition, both of the classes of methods cannot commendably process the uncertainty and imprecision of text sentences. In this paper, we start with the uncertainty and imprecision of text data. We improve the tolerance rough set model [6] by Ho and present a sentence similarity computation model based on the probabilistic tolerance rough set model. Our model can not only process the uncertainty and imprecision of text data, but also overcome the shortcomings mentioned before.

This paper is organized as follows. Some related works on sentence similarity measures are reviewed in Section 2. Section 3 presents our proposed probabilistic tolerance rough sets-based model for sentence similarity computation in detail. Section 4 demonstrates the experimental results and discusses on sentence similarity tasks. In Section 5, some conclusions are made.

The main work is to improve the traditional tolerance rough set model, and then establish a sentence similarity computation model based on the probabilistic tolerance rough set model. In this section, we discuss some related works about sentence similarity calculation methods and tolerance rough set models in NLP.

2.1. Sentence Similarity Calculation

Traditional works about sentence similarity are generally categorized into two classes, methods based on shallow semantics and methods based on deep learning algorithms. The idea of shallow semantics methods is to calculate the similarity between words. Methods based on words’ cooccurrence and on corpus are two representatives. Methods based on words’ cooccurrence are mentioned in [79]. Han . used the Bag-of-Words (BoW) technique [8], and Jones . [7] applied the term frequency inverse document frequency (TF-IDF) technique to represent sentences, and then the cosine distance or Euclidean distance was utilized to calculate the similarity between sentences. A keyword-based approach was proposed [9], which calculates the keywords’ ranking score extracted in the sentences. Methods based on corpus such as WordNet, HowNet are mentioned in [10, 11]. In [12], Prasad . combined common words and semantic features for measuring sentence similarity. They extracted both syntactic features by searching for common words between sentences and semantic features by utilizing information content of sentences. Methods based on shallow semantics can only obtain the literal meaning of sentences, and fail to capture high-level semantics information behind sentences.

Nowadays, neural network and deep learning have been widely used in NLP and have made great achievements. By training sentences with deep learning algorithms, deep semantics information can be captured in the computation of sentence similarity. In [13], a CNN-based parallel semantic matching model was established; two parallel CNNs were built to train two sentences, respectively. Then, the two CNNs were cascaded into one multilayer neural network for matching the similarity of sentences. An elaborate convolutional network (ConvNet) variant was presented [14], which inferred sentence similarity by integrating differences of convolutions at different scales. For the problem of variable length sentences and complex sentences, Mueller . proposed a Siamese Network on the basis of the long short-term memory (LSTM) model [15]. Methods mentioned before mainly concentrate on the similar information of two sentences; on the bias, methods concentrated on the dissimilar information of two sentences were proposed. Wang . developed a sentence similarity learning model by decomposing and composing lexical semantics which considered both the similar information and dissimilar information between sentences [16]. In [17], a context-aligned recurrent neural network (CA-RNN) model was put forward. In this model, the contextual information of the aligned words was integrated in the neural network. Liu . incorporated the shallow semantics and deep information to evaluate the sentence similarity [18]. The shallow part is represented by the lexical similarity based on keywords and sentence lengths; and the deep part is modeled by a parallel CNN which extracts both the whole sentence and their context as the features. However, most of the sentence similarity learning algorithms based on neural network and deep learning are supervised, which need to train the data set first. Jacob . [19] proposed an unsupervised Bidirectional Encoder Representations from Transformers (BERT) model, which has reached excellent results for language representation.

It is undeniable that text data possess uncertainty, imprecision, and incompleteness. However, methods mentioned above do not measure the similarity between sentences from the perspective of uncertainty and imprecision. Fuzzy set theory and rough set theory are created to process such uncertainty and imprecision. A fuzzy set and rough sets-based approach was developed for measuring cross-lingual semantic similarity [20]. In [1], Chatterjee . proposed a fuzzy rough sets-based model. Sentence similarity was computed according to the upper approximation and lower approximation of two sentences.

We improve the traditional tolerance rough set model and propose a sentence similarity computation model based on the probabilistic tolerance rough set model. With the model from the point of the uncertainty of text data to process text data, it can not only solve the problem of inability to obtain high-level semantics information on methods based on shallow semantics, but also overcome the drawback of supervision on methods based on deep learning algorithms, with the advantages of capturing more latent semantics information and nonsupervision.

2.2. Tolerance Rough Sets in NLP

Rough set theory was proposed by the Polish scholar Pawlak for handling uncertainty, imprecision, and fuzziness in 1982 [21]. It has been effectively applied in the field of machine learning, data mining, and NLP [2224]. Rough sets partition a set by using an equivalence relation. Whether one certain object belongs to a set or not is represented by a pair of concepts called lower approximation space and upper approximation space. A possible part is the upper approximation except the lower approximation, called the boundary region. Researchers generalized rough set theory to some expanded models according to different requirements, including probabilistic rough set model [25], decision rough set model [26], and tolerance rough set model [27]. An equivalence relation contains three properties of reflexivity, symmetry, and transitivity, in which the limitation of transitivity leads to the inapplicability in some cases. The tolerance relation was introduced to replace the equivalence relation by Skowron . since some applications cannot achieve the condition of transitivity, and the corresponding model was tolerance rough set model [27].

With the tolerance rough set model applied in NLP, a search result clustering method was put forward [28], in which the tolerance relation was defined as the number of word cooccurrences in documents. In [29], a tolerance rough sets-based semantic clustering algorithm is introduced by Meng . for web search results, extending the original text semantics and processing the limitation on the sparsity of data. A nonhierarchical document clustering algorithm was established by Ho . [6] for information retrieval based on a tolerance rough set model, which can capture more potential semantics information. Patra and Nandi developed a single-link clustering algorithm on the basis of tolerance rough set model to obtain a better clustering result [30]. In this paper, we adopt the tolerance rough set model via expressing each sentence as a pair of upper approximation and lower approximation to separately compute the upper approximation similarity and lower approximation similarity.

3. Proposed Method

In this section, we firstly describe the traditional tolerance rough set model briefly. Then, we introduce the probabilistic tolerance rough sets-based sentence similarity calculation model detailedly.

3.1. Tolerance Rough Set Theory

A tolerance space was defined as a quadruple [6], where is the universe of all the objects, is an uncertainty function, , a set of tolerance classes, is a vague inclusion, and is a structural function. The uncertainty function is defined as a tolerance class. If an object shares similar information with , it is an element of . Any function satisfying reflexivity and symmetry can be defined as an uncertainty function , that is, for arbitrary iff . The vague inclusion is monotonous, i.e., for any and , . It measures the degree of inclusion of sets, whether a set contains the tolerance class of an object . The structural function is defined as two classes—structural subsets and nonstructural subsets —which are on functions of for each [6]. The upper approximation and lower approximation of any are defined as

If the upper approximation and lower approximation are with parameters and , which are denoted aswhere , , , then it is called as the probabilistic tolerance rough set model [25].

3.2. Probabilistic Tolerance Rough Sets-Based Sentence Similarity Model

Firstly, we introduce the definition of the quadruple of tolerance rough sets in our model. Suppose that is the set of all the words in the corpus, where is the vocabulary size. Then, we define the universe as . The determinations of tolerance relation and tolerance classes are the essential steps for formulating a tolerance rough set model. In the tolerance rough set model proposed by Ho . [6], the cooccurrence of terms in all the documents in the corpus was applied to construct the tolerance relation. However, it suffers from two disadvantages: (1) whenever the whole corpus gets some changes, even increasing or decreasing by only one document, all the procedures need to be recalculated; (2) the time complexity is relatively high. Hence, we choose the word similarity between words as the tolerance relation. Generally, the semantics similarity between two words is defined as the cosine similarity between the word vectors of the two words [31]. When the corpus increases or decreases one document, the number of cooccurrences of all the words will change and recalculation is needed, but the word similarity between words does not need to change. It provides the model employing the new tolerance relation to be incremental. According to the algorithm flow of the tolerance rough set model, the time complexity decreases from to .

For a positive threshold , , the uncertainty function of is defined as follows:where denotes the cosine similarity degree between the word and .where and denote the word vectors of and , respectively. It is evident that the uncertainty function satisfies the condition of reflexive and symmetric.

Here, we give a counterexample to illustrate when does not satisfy the property of transitivity. Using the trained word2vec embeddings by Google [32], we can obtain , , and . Let the cosine similarity degree threshold ; it is obvious that , , and . So, we conclude that the uncertainty function does not satisfy the condition of transitivity.

The vague inclusion function is defined the same as in [6]:

Let be a collection of sentences, where is represented by a group of words of the universe , . Then, the fuzzy membership function for is expressed as

Suppose that all the tolerance classes of words are structural subsets in the whole process, i.e., for any . Then, we define the upper approximation and lower approximation in of any aswhere , , . and in are also written as and .

If is regarded as one certain concept about the vague description of feature ; then can be explained as a collection of concepts that share some semantics with , and can be explained as a collection of the core concepts of . The probability values and can be used to adjust the accuracy of upper approximation and lower approximation.

Each sentence is denoted by two fuzzy sets on both upper approximation and lower approximation. Assume that one sentence is made up of a collection of words ; then the upper approximation and lower approximation of are represented by

Considering the membership degree only, the upper approximation and lower approximation of can also be written aswhere and denote the membership degrees of to and , respectively. The upper approximation represents the expanded semantics of sentence , capturing the latent semantics that contains. The similarity between two sentences can be measured by both the upper approximation similarity and lower approximation similarity of the two sentences. From the two different perspectives, both the expanded semantics similarity and the core semantics similarity can be captured sufficiently. For each sentence has been represented by two fuzzy sets, we employ two measurements to calculate the similarity between the two fuzzy sets, as defined as follows.Measurement 1.Measurement 2.On the basis of the upper approximations and lower approximations of the two sentences, except for representing each sentence by a pair of fuzzy sets, we propose another method to measure the similarity. Assume that the elements of the upper and lower approximation of sentence and sentence are(i),(ii),(iii),(iv), then a new similarity degree measurement is defined as follows.Measurement 3. ConsiderThe lower similarity determines the degree to which two sentences are similar assuredly. Correspondingly, the upper similarity determines the degree to which two sentences are similar possibly. To measure the final similarity degree of the two sentences, we utilize the linear combination of the upper and lower approximation similarity, which is given aswhere , is the linear coefficient. indicates the proportion of the upper approximation similarity degree and indicates the proportion of the lower approximation similarity degree. On account that the lower approximation is composed of the core semantics, the proportion of the lower approximation similarity degree is assigned a higher value than the upper approximation similarity degree. Generally, . (Algorithm 1)

Algorithm 1: Probabilistic tolerance rough sets-based sentence similarity model
Input: A collection of sentences .
Parameters: The cosine similarity degree threshold: ; the probabilistic value: , ; the linear combination parameter: .
Output: The similarity degree between and ,.
(1)Preprocess the sentence corpus , and generate the universe including all the distinct words of the corpus.
(2)Compute the uncertainty function of each word in the universe according to equation (3).
(3)Suppose that the similarity degree between sentence and is to be calculated. Apply equation (6) to calculate the fuzzy membership degree of each word in sentence , , .
(4)Obtain the upper approximation and lower approximation of each sentence according to equation (7) equation and (8). Similarly, acquire and .
(5)Represent the upper approximation and lower approximation of and as fuzzy sets according to equation (9) and equation (10), which are written as , and .
(6)Calculate the upper approximation similarity between and and the lower approximation similarity between and according to equations (12)-(17) of the three measurements, respectively.
(7)Obtain the final sentence similarity degree between and utilizing the linear combination in equation (18).

Example 1. Here, we give an example of our proposed methods to calculate the sentence similarity. Assume that the corpus contained four sentences as follows:(i).(ii).(iii).(iv).After preprocessing every sentence, 9 words are included in the corpus. Then, let the universe be the set of words . Then, we illustrate the proposed probabilistic tolerance rough sets-based sentence similarity model for computing the similarity degree of the following sentences:(i).(ii).Here, we set the similarity degree threshold and the probabilistic values and . Then, the upper and lower approximations of these two sentences are shown in Table 1.
The upper approximation similarity degrees and lower approximation similarity degrees by the proposed three measurements are listed in Table 2.
Let the linear combination coefficient ; then the final similarity degrees between and by three measurements are as follows:(i),(ii),(iii).It is apparent that our proposed probabilistic tolerance rough sets-based sentence similarity algorithm can reflect the similarity relation between sentences commendably. Firstly, from the sentences and , it is evident that both of them express the core semantics of “jumping” and “leaves,” just like the lower approximation obtained by our algorithm. Secondly, the lower approximation similarity degree is computed to be 1, which means that and share the same core meaning. Thirdly, from the upper approximation of , it can be seen that the word “children” did not originally belong to , but the meaning of “children” is mined through our method. The new meaning “children” comes from the tolerance class of the word “kid,” so, in a sense, “children” is the explanation of “kids.” Therefore, our proposed methods can capture some latent semantics behind texts from upper approximation, which can better distinguish whether two sentences are similar from a more general perspective. Analogously, our proposed algorithms can refine the core semantics of texts by the lower approximation, which can preferably analyze sentence similarity from a more accurate perspective.

Example 2. We use the traditional tolerance rough set model [6] on Example 1 for comparison. The word cooccurrence degree is set as 2. Then, the upper and lower approximations can be seen in Table 3.
Table 4 displays the corresponding upper and lower approximation similarity degrees.
Then, the sentence similarity degrees of the three measurements are as follows:(i),(ii),(iii).From the results, we can see that it provides a worse performance in contrast with our methods.
Then, we discuss the condition that one sentence “” is added to the corpus in Example 1. The whole computational process and results by the probabilistic tolerance rough sets do not alter. However, the procedures have to repeat from the calculation of uncertainty function by the traditional tolerance rough sets; then the new upper and lower approximation of and are illustrated in Table 5. Thus, the applicability of the model [6] has been greatly reduced.

4. Experimental Results and Discussion

In this section, we take from SICK2014 task and STSbenchmark dataset to evaluate the performance of our methods.

4.1. Dataset and Preprocessing

SICK2014 [33] is a dataset for the similarity evaluation of sentence pairs, which contains the training set, trial set, and testing set for a total of 15000 sentence pairs. Since our proposed model is unsupervised, which do not require additional training on the dataset, we select the 5000 sentence pairs of the training set for the experiments. And each sentence pair has been assigned a similarity score from 0 to 5 by experts. Table 6 shows two examples in the SICK2014 dataset.

STS is the abbreviation for Semantic Textual Similarity. The SemEval STS datasets from 2012 [34] to 2017 [35] were selected for this dataset. Each sentence pair has been assigned a similarity score from 0 to 5 by experts. STS-train, STS-dev, STS-test, and MSRvid are chosen for the experiments.

For better comparison with our experimental results, we have normalized the similarity score. We take the word embedding trained by Google [23] as the word vector in the experiment.

4.2. Evaluation Metrics

We exploit the Pearson correlation coefficient (Pcc) [36] and mean square error (MSE) [37] to evaluate the performance of sentence similarity measurements.

Pcc is a linear correlation coefficient that reflects the linear correlation of two variables. As for two variants and , the mathematical expression of Pcc is defined aswhere is the covariance of X and Y, and denote the standard deviation of and individually, and refers to the mathematical expectation of . The greater the absolute value of Pcc, the stronger the correlation is.

MSE is a measure reflecting the degree of difference between estimator value and real value. The definition of MSE is where N is the sample size, is the real value, and is the estimator value. A smaller value of MSE demonstrates a smaller deviation between the estimator value and the real value.

4.3. Experimental Results and Analysis

We proposed three sentence similarity measurements based on the probabilistic tolerance rough set model. The performances on the SICK2014 dataset are displayed in Table 7. In the table, BERT-687 and BERT-1024 are two different BERT models for sentence representation, and the sentence similarity is calculated by the cosine similarity. Fuzzy rough is the model proposed in [12]. As can be seen in Table 7, on the whole, the three measures have much better performance than the other three models. Obviously, the results of Measurement 3 achieve at the optimal performance of Pcc as 0.725 and MSE as 0.033. Particularly for the value of MSE, it is evident that there is very small error between the sentence similarity degree calculated by our methods and the real value.

Tables 8 and 9 show the Pcc and MSE results on the STSbenchmark dataset. From the tables, we can see that all of the three measures have much better performance than the results by BERT on the four datasets of STSbenchmark. The reason is that more latent semantics behind sentences can be captured by our models. Therefore, the experimental results confirm the efficiency and applicability of our methods.

4.4. Cosine Similarity Degree Threshold

In our improved probabilistic tolerance rough set model, the cosine similarity degree threshold controls the accuracy of the uncertainty function. The higher the value of , the more precise the uncertainty function. However, too high value of will result in inadequate semantics mining. Too small value of will lead to more redundancy and noisy information. The interaction of on Example 1 can be identified in Table 10.

Figures 1 and 2 reveal the interactions of different cosine similarity threshold value on Pcc and MSE, respectively. In this experiment, is set as 0, is set as 0.6, and is set as 0.3. ranges from 0.5 to 1. As shown in Figure 1, the value of Pcc increases from , achieves the peak at , then decreases. Similarly, the value of MSE decreases from , achieves the minimum at , then increases. We can conclude that the interaction of satisfies the regular analyzed above.

4.5. Probability Value

The values of and are used for adjusting the precision of upper approximation and lower approximation. determines the range of upper approximation. The smaller the value of is, the more elements the upper approximate set has. In the traditional tolerance rough sets, is 0, by which the upper approximation contains the most information. It can reduce the generation of redundant information and does not lose too much potential semantics information via adjusting the value of . The influence of on Example 1 can be observed in Table 11. Similarly, determines the range of lower approximation. A larger value of leads to fewer elements of the lower approximate set. When , the fewest elements are included in the lower approximation, which may cause the loss of some core semantics information. Adjusting properly may better and more adequately mine core semantics information. The effect of on Example 1 can be observed in Table 12.

5. Conclusion

In this paper, owing to the property of uncertainty of text data, we incorporate the probabilistic tolerance rough sets to establish a novel sentence similarity computation model. For the reason that the traditional tolerance rough set model is not incremental and has high complexity, we make some improvement to it, making the model becoming incremental and reducing the time complexity. Through introducing the probability values and , the accuracy of the upper approximation and lower approximation can be adjusted. The upper approximation and lower approximation are served to represent every sentence. And on this basis, three sentence similarity calculation measurements are proposed. Upper approximation similarity and lower approximation similarity are individually calculated of each sentence pair. The linear combination of the upper approximation similarity and the lower approximation similarity is used to indicate the total sentence similarity. On the one hand, it can dig out more latent semantics information than the traditional methods based on shallow semantics. On the other hand, it is unsupervised, which relieves the defect of supervised deep learning-based methods. We carry out some experiments on the SICK2014 task to evaluate the performance of our proposed model. The results verify the efficiency and applicability of the proposed models.

The proposed model is established without considering the order of sentences, in which our future work will include it.

Data Availability

The SICK2014 task data used to support the findings of this study are available from clic.cimec.unitn.it/composes/sick.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant nos. 11671001 and 61876201).