[Retracted] Accurate Estimation of English Word Similarity Based on Semantic Network

Gong, Qifeng

doi:https://doi.org/10.1155/2021/8112735

Security and Communication Networks

On this page

Abstract Introduction Related Work Discussion Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Research Article Retraction

!

This article has been Retracted. To view the article details, please click the ‘Retraction’ tab above.

Special Issue

Massive Machine-Type Communications for Internet of Things

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 8112735 | https://doi.org/10.1155/2021/8112735

[Retracted] Accurate Estimation of English Word Similarity Based on Semantic Network

Qifeng Gong¹

Academic Editor: Jian Su

Received03 Aug 2021

Revised29 Aug 2021

Accepted02 Sept 2021

Published24 Sept 2021

Abstract

The application of artificial intelligence in the field of English needs to process a large amount of English text data, but the deviation of English word similarity reduces its overall English translation accuracy and data processing efficiency. Therefore, this paper proposes an accurate estimation of English word similarity based on semantic network, which combines a variety of computing methods to form a compound computing structure based on semantic network. The experimental results show that the error between the Semantic Web-based English word similarity calculation method and manual evaluation is small, and the accuracy of English word similarity calculation is improved to a certain extent. In addition, compared with other English word similarity calculation methods, the English word similarity calculation method based on semantic network is more in line with people’s cognition and understanding of knowledge, has higher reliability, and has certain practical value in the field of English.

1. Introduction

With the continuous development of the international economy, the cooperation between countries in all aspects is increasing and deepening. People can obtain a large amount of information from all over the world through the Internet, which is no longer limited to the communication and learning of a certain text. As the most widely circulated language in the world, English is used in various fields, and there are more and more occasions to express and communicate in English [1]. Therefore, English learning and English translation have attracted extensive attention. With the development of artificial intelligence, there are new ways and means of English learning and translation. The application of artificial intelligence in English needs to achieve the goal of high quality and high efficiency, which requires it to process English information quickly and effectively [2]. On the one hand, the similarity of English words can help AI quickly understand a kind of similar text information, but on the other hand, if there is a large error in the recognition of English word similarity, it will also affect the recognition of English words and text by AI [3]. This task is also widely used in machine translation [4]. In addition, nonnative language learners also need to understand and learn similar English words when learning English, which needs the help of an English dictionary [5]. With the development of natural language processing technology, many English word semantic resources have been widely used. Based on these semantic dictionaries, many studies on the semantics of English words have put forward the corresponding calculation of English word similarity, which has attracted extensive attention for a time. However, there are large errors in the calculation of similarity of many English words under the influence of many factors, such as the update of English semantic dictionary, the improvement of the relationship between words, the influencing factors of similarity calculation model, and so on [6]. Netisopakul et al. [7] created three Thai word similarity datasets by translating and rerating the popular WordSim-353, SimLex-999, and SemEval-2017-Task-2 datasets. Therefore, how to improve the estimation accuracy of English word similarity has become a hotspot in English semantic research. Ihm et al. [8] proposed a word embedding method for Korean, which is called Skip-gram-KR, and a Korean affix tokenizer. Skip-gram-KR creates similar word training data through backward mapping and the two-word skipping method.

This paper proposes an accurate estimation of English word similarity based on semantic network, which is mainly divided into three parts. The first part introduces the concepts of semantic network and word similarity and expounds its development status. The second part describes the semantic network model of English words and introduces the corresponding calculation methods of English word similarity. The fourth part is an experiment on the accurate estimation of English word similarity based on semantic network and analyzes the experimental data accordingly.

Our contribution is threefold:(1)This paper proposes an accurate estimation of English word similarity based on semantic network, which combines a variety of computing methods to form a compound computing structure based on semantic network(2)The experimental results show that the error between the Semantic Web-based English word similarity calculation method and manual evaluation is small, and the accuracy of English word similarity calculation is improved to a certain extent(3)Compared with other English word similarity calculation methods, the English word similarity calculation method based on semantic network is more in line with people’s cognition and understanding of knowledge, has higher reliability, and has certain practical value in the field of English

Semantic network is one of the famous knowledge representation models. Its core idea is to help understand the domain knowledge and make corresponding inferences through abstract relational structure [9]. Semantic networks usually express data structures in the form of graphs, in which nodes and relationship types enable them to be expressed and stored in natural languages in a way that human beings can understand and store [10]. The earliest abstract graph structure appeared in the field of higher mathematics in the nineteenth century, and mathematicians constructed the abstract relational network of algebra through tree structure. At the same time, it was proved that the key to solve the core problem was to analyze the structure itself rather than the relational meaning [11]. In the following decades, the logic diagram system developed continuously. Until 1956, the first semantic network was born in the computer field and applied to the field of natural language machine translation [12]. The emergence of artificial intelligence in the 1960s has made the semantic network develop unprecedentedly and become a research hotspot in the field of artificial intelligence. In the later development, the semantic network model has been continuously optimized, gradually solved the problems of logical conjunction and description, and shifted from the research of knowledge representation system to strict logical semantic reasoning [13]. With the development of Internet technology, the concept of Semantic Web with Web data as the core was put forward in 1998, and its essence is still semantic network [14]. In 2012, the concept of knowledge atlas was formally put forward. It is the same as the Semantic Web in terms of visualization. The difference is that the knowledge atlas focuses more on knowledge retrieval, while the Semantic Web focuses more on computer-oriented search.

The research and application of word similarity have a long history. There are two main calculation methods. One is to collect and count data based on context and through large-scale corpus or word definition and finally evaluate word similarity [15]. The other formula calculates the similarity through the relationship structure and hierarchy in the dictionary. Based on WordNet, some scholars estimate the similarity by calculating the shortest distance of synonym set and the number of path transformations [16]. Other scholars use ontology and corpus to calculate the similarity according to the common node range and information of its synonym set [17]. Other scholars have proposed to calculate the similarity between words by the vector space method after extracting synonyms [18]. However, the above methods will be affected by the English dictionary and only focus on the estimation of similarity from the semantic aspect of English words, which is prone to large errors in accuracy. To solve this problem, some scholars have proposed the calculation methods of similarity such as semaphore and feature structure and achieved good results. However, with the development of computing methods, there are still many ways to improve the estimation of English word similarity.

3. Methodology

According to the theory of cognitive psychology, it can be considered that the English words learned by the human brain are not stored in the form of individual, but in the form of network. Therefore, when new information, that is, learning new English words, stimulates the corresponding part of the stored knowledge in the human brain, it will be associated with new words and be able to identify and recognize new words. In these networks, each English word has its corresponding position, which can be regarded as its nodes in the network. These nodes are connected to form a knowledge network, including a semantic network. Semantic networks and other networks are independent and interrelated. When the cognitive system in the human brain activates a node or English word, it will quickly extract the relevant knowledge around the word. The accurate estimation of English word similarity needs to compare the relevant knowledge and content of two English words, so as to calculate the overall similarity. This is similar to the process of English word learning and discrimination by human brain. Therefore, this paper chooses to estimate the similarity of English words based on semantic network.

3.1. Semantic Network Model of English Words

The semantic network model of English words can be divided into hierarchical network model and activation diffusion model. Both of them store the concept of English words in network nodes, emphasizing the interconnection, activation, and suppression between nodes and between nodes and networks.

The hierarchical network model is a model with hierarchical network structure, which is proposed based on the computer model of speech understanding. In this model, the basic unit is the concept, which constructs a hierarchical network according to the logical hierarchical relationship. Also, each concept has its own different characteristics. The concept or concepts with the strongest coverage are located in the top layer of the network model, the concepts of the next level are in the next layer, and so on to form a hierarchical network model. Figure 1 shows a schematic diagram of the hierarchical network model.

The hierarchical network model focuses on the category and attribute relationship of English words. The location nodes of different words are different in the network. For example, bird nodes and fish nodes are the same network layer, but they are lower than animal nodes. The hierarchical network model not only includes the relationship between upper and lower word meanings, that is, subordinate relationship, but also includes the relationship between left and right similar word meanings, that is, the relationship between left and right meanings, so as to form a hierarchical system with clear context. Such a network can store information as much as possible in a limited space, save space, and quickly extract and apply word-related information.

Compared with the hierarchical network model, the activation diffusion network has a relatively more complex conceptual relationship network, and the activation diffusion network constructs the network relationship between concepts based on semantic connection or similarity. The concept of each English word occupies a node position in the activation diffusion network and is connected through the semantic relationship of the words, and the closeness of the relationship between the nodes is expressed by the length of the connecting line between the two. The shorter the connection line between nodes, the closer the relationship between them and the higher the similarity between them. When the concept of an English word needs to be extracted or processed, the node in which it is in will be activated, and this state will spread to other nodes with the connection line between nodes. The closer the node is, the more obvious the activation state is, and the farther the node is, the weaker the activation state is. Figure 2 shows the schematic diagram of the activation diffusion network.

The advantage of activation diffusion network model is that it can select any node for activation diffusion without considering the context of English words. Its flexibility not only enables it to contain more uncertainty and ambiguity, but also extracts the concepts and features of English words through multiple paths.

3.2. Calculation Method of English Word Similarity Based on Semantic Network

There are two methods to calculate the similarity of English words. One is to collect and count the data by using the context-based method in the definition of large-scale words, so as to evaluate the similarity of English words. The other is to calculate the similarity through word relevance and hierarchical structure based on an English dictionary. Therefore, to estimate the similarity of two English words, we need to first determine the synonym set to which the two English words belong, then combine the words in the synonym set for similarity calculation, and finally estimate the similarity between the two English words according to the similarity results of the combination in the synonym set. Take WordNet as an example.

The synonym sets in WordNet constitute the upper and lower relationship network in the hierarchical network model. Therefore, in the upper and lower relationship network, the farther the distance between synonym sets, the lower the semantic similarity, and the higher the location density of synonym sets, the more detailed the division of local words and the lower the similarity. Similarly, the deeper the level, the more specific the concept described and the greater the similarity. In order to better estimate the similarity between English words, three measurement factors need to be introduced, namely, distance, density, and depth. The calculation formula of distance factor is shown as follows:where represents the distance between synonym sets and represents the threshold parameter. As can be seen from the above formula, when the distance becomes larger, will become smaller. When the distance is greater than , is 0.

The relationship between density and word similarity is opposite, that is, the greater the density, the lower the similarity. The calculation of density mainly starts from the number of nodes where local English words are located, that is, from the two selected nodes to the network at the upper level, three layers are taken, and the number of nodes contained in each layer is represented by , respectively. In this process, if two nodes meet, the process ends, and the number of nodes in the upper layer is 0. The final number of local nodes is calculated as follows:where represents the number of nodes in the starting layer of the node and represent the number of nodes in the upper layer in turn.

The density factor is calculated as follows:

It can be seen from the above formula that the larger the number of local nodes, the smaller the density factor and , thus .

The calculation formula of depth factor is shown as follows:where is the depth of the node and is the average depth of all nodes in the whole semantic network. When the depth exceeds the average value, the depth factor is positive and vice versa. After considering the three influence factors, the similarity of the two synonym sets is calculated as follows:

When , is taken. and represent the average value of density factor and depth factor, respectively, represents the weight of density factor, and represents the weight of depth factor.

If the number of synonym sets is large, there will be one or more pairs of synonym sets with large similarity, so the weight of the first pair needs to be controlled and should not be too high; otherwise, there will be large errors in the comprehensive balance. Formula (6) can adaptively adjust parameters according to the combination number of synonym sets:where , and it decreases with the increase of the number of synonym sets, .

When estimating the similarity of English words in WordNet, in addition to the synonym set, it is also necessary to select synonyms and extract features for generic words and meaning interpretation, as shown in the following formula:where represents English words requiring similarity calculation, represents all synonyms of English words, represents all relevant categories, and represents all notional words in English word interpretation.

The similarity of English words is calculated according to the features obtained in formula (7) and the distance between three different feature spaces. The calculation formula of meaning similarity iswhere represents the meaning order, represents the reciprocal of documents after training, represents the ratio of 1.5 to synonym feature weight, represents the ratio of 1 to generic word feature weight, and represents the ratio of 0.5 to meaning interpretation weight.

Formula (9) is the calculation formula of word similarity:where and represent the number of meanings of words 1 and 2, respectively.

In addition to WordNet, semantic network has other specific expression models, and the calculation of English word similarity requires the integration of multiple knowledge, as shown in the following formula:where represent two words requiring similarity comparison and the adjustable parameters are expressed as and .

The similarity calculation based on English vocabulary knowledge base is shown as follows:where , respectively, represent the number of meanings of English words requiring similarity comparison. Formula (12) is the calculation formula of semantic similarity of English words:where represents the weight value set in different levels, represents the number of branch nodes, and represents the distance between branches.

In order to alleviate the data coefficient problem in the similarity comparison of English words, the word vector is used to better represent the meaning of English words. The similarity of two English words is expressed by the cosine value corresponding to the angle between the two vectors, that is, the smaller the angle, the closer the cosine value to 1 and the higher the similarity of the two words, as shown in the following formula:where and represent the feature vectors of two English words and represents the angle between the two feature vectors mapped in space.

English words are not a whole. They can be divided into root, prefix, and suffix. Therefore, the similarity of two English words can also be calculated through the similarity estimation of prefix, suffix, and root, as shown in the following formula:where and represent the affix set of English words and , respectively.

4. Result Analysis and Discussion

In this paper, the accurate estimation algorithm of English word similarity based on semantic network is a compound similarity calculation method based on semantic network and combined with four factors: synonym set, multiple knowledge, word vector, and word meaning. For the evaluation of accurate estimation of English word similarity based on semantic network, 100 groups of words are selected, and the similarity of each group of words will be manually evaluated in advance. The similarity range is [0,1], where 0 indicates that there is no similarity between words and 1 indicates that the two words have the same word meaning. In this paper, by comparing its extremely calculated results with the pre-evaluation results, it is mainly measured by formula (15), which can show the size of the relationship between variables. The larger the result, the stronger the result relationship between variables.

In the formula, variable represents the computer result of machine similarity, variable represents the similarity result manually evaluated in advance, represents the number of test samples, represents the ranking result of variable, and represents the ranking result of variable.

Figure 3 shows the result and ranking of similarity calculation between the English word “sanctity” and some words. It can be seen from the results in the figure that the calculation results of machine similarity in this paper are close to the results of pre-manual evaluation and the error between the two results is small. This shows that the calculation results of machine similarity in this paper are more in line with people’s cognition. In addition, the results of machine similarity calculation in this paper are sorted from high to low according to the value of the results, which has a certain practical value in English information retrieval. Similarity and ranking of some words are shown in Figure 3.

In order to further test the performance of the accurate estimation method of English word similarity based on semantic network, as shown in Figure 4, the calculation results of the same group of words’ similarity under different coefficient combinations are shown. Among the four parameter combinations, the first group is set as (1,0,0,0), the second group is set as (0,1,0,0), the third group is set as (0,0,1,0), the fourth group is set as (0,0,0,1), the fifth group is set as (0,0,0.5,0.5), the sixth group is set as (0.5,0.5,0,0), the seventh group is set as (0.4,0.6,0,0), the eighth group is set as (0.3,0.7,0), the ninth group is set as (0,0.6,0.4), and the tenth group is set as (0,0,0.7,0.3). It can be seen from the results in the figure that although some results can be achieved by using a certain similarity calculation method alone, it is still lower than that of the composite calculation method. Except for the fifth group of composite calculation, the Spearman values obtained by the composite calculation methods of other groups are higher than those of the first four groups. This composite calculation method can improve the estimation accuracy of English word similarity to a certain extent.

Figure 5 shows the comparison between the accurate estimation method of English word similarity based on semantic network and the calculation method of word semantic similarity based on Zhiwang. They calculate the similarity of seven groups of English words, respectively. From the comparison results in the figure, it can be seen that there is an increasing gap between the results of the two calculation methods. There is a relatively large deviation between the results obtained by the vocabulary semantic similarity calculation method based on Zhiwang and people’s understanding of English words, especially the estimation of the similarity of two groups of English words: paper and article, paper and literature, which has a certain relationship with its calculation method. The accurate estimation method of English word similarity based on semantic network has similar results for paper and article, paper and literature, both of which are more than 0.90. This shows that the compound calculation method and the adjustment of its parameters are effective, can meet people’s cognitive understanding of English words, and has good reliability.

In addition, this paper conducts clustering experiments on the similarity calculation results obtained by the three algorithms, and the accuracy is shown in Table 1:

From the above experiments, it can be found that the Chinese text algorithm proposed in this paper, which combines the advantages of the two similarity algorithms, performs better than the traditional algorithm in many cases. As can be seen from the above experimental data, the algorithm presented in this paper is superior to that of [7, 8], in the accuracy of clustering experiments. This shows that the algorithm in this paper has been improved to some extent. However, in the professional field, the performance of the algorithm in this paper does not reach the ideal index because many proper nouns are not yield and the semantics of proper nouns are not correctly analyzed, so the calculation is inaccurate and the results lack scientific basis. This is also a problem to be solved in the field of text similarity calculation. In general, the method presented in this paper is satisfactory in terms of accuracy and efficiency.

5. Conclusion

With the development of Internet technology, people can easily obtain a large amount of information from different languages all over the world, among which English is one of the most widely used languages. The development of artificial intelligence technology has changed the way people learn English or use English and related information, especially in English dictionary query entries and English translation. If AI wants to process English-related text data quickly and efficiently, it needs to accurately identify English words. However, there are still large errors in the calculation and recognition of English word similarity, which reduces the efficiency of artificial intelligence. Therefore, this paper proposes an accurate estimation of English word similarity based on semantic network. Based on semantic network, a compound word similarity calculation method is constructed by integrating four methods: synonym set, multiple knowledge, word vector, and word meaning. The experimental results show that the accuracy of English word similarity accurate estimation method based on semantic network is higher than a single similarity calculation method after selecting appropriate parameters, which shows that the compound similarity calculation method can improve the accuracy of English word similarity to a certain extent. At the same time, the error between the results of the accurate estimation algorithm of English word similarity based on semantic network and the pre-manual evaluation results is small, which is in line with people’s cognitive understanding. Also, it can automatically sort according to the degree of similarity, which has practical value in English information search. Compared with other English word similarity calculation results, the English word similarity calculation results based on semantic network are more reliable. This shows that the accurate estimation method of English word similarity based on semantic network and the adjustment of parameters are helpful to improve the calculation of English word similarity. However, it is difficult to determine the parameters of the accurate estimation method of English word similarity based on semantic network, and there are relatively few data for comparison and application, so further application experimental research is needed. In the future, we need to study the parameter adjustment method of this method to further improve the performance of the model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no known conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.

References

Q. N. Hong, P. Pluye, M. Bujold, and M. Wassef, “Convergent and sequential synthesis designs: implications for conducting and reporting systematic reviews of qualitative and quantitative evidence,” Systematic Reviews, vol. 6, no. 1, 61 pages, 2017.
View at: Publisher Site | Google Scholar
J. Zhang, S. Zhai, H. Liu, and J. A. Stevenson, “Social network analysis on a topic‐based navigation guidance system in a public health portal,” Journal of the Association for Information Science and Technology, vol. 67, no. 5, pp. 1068–1088, 2016.
View at: Publisher Site | Google Scholar
R. Qu, Y. Fang, W. Bai, and Y. Jiang, “Computing semantic similarity based on novel models of semantic representation using Wikipedia,” Information Processing & Management, vol. 54, no. 6, pp. 1002–1021, 2018.
View at: Publisher Site | Google Scholar
B. Zhang, D. Xiong, J. Xie, and J. Su, “Neural machine translation with GRU-gated attention model,” IEEE transactions on neural networks and learning systems, vol. 31, no. 11, pp. 4688–4698, 2020.
View at: Publisher Site | Google Scholar
G. Punj, “The relationship between consumer characteristics and purchase intention for general online content: Implications for content providers considering subscription-based business models,” Marketing Letters, vol. 26, no. 2, pp. 175–186, 2015.
View at: Google Scholar
L. I. Fang, “Design of high similarity English words autonomous selection system,” Modern Electronics Technique, vol. 6, no. 40, pp. 147–150, 2017.
View at: Google Scholar
P. Netisopakul, G. Wohlgenannt, and A. Pulich, “Word similarity datasets for Thai: construction and evaluation,” IEEE Access, vol. 7, Article ID 142907, 142915 pages, 2019.
View at: Publisher Site | Google Scholar
S.-Y. Ihm, J.-H. Lee, and Y.-H. Park, “Skip-gram-KR: Korean word embedding for semantic clustering,” IEEE Access, vol. 7, pp. 39948–39961, 2019.
View at: Publisher Site | Google Scholar
J. Wang, Z. U. O. Xianglin, and Z. U. O. Wanli, “Word semantic similarity measurement based on evidence theory,” Acta Auto Matica Sinica, vol. 41, no. 6, pp. 1173–1186, 2015.
View at: Publisher Site | Google Scholar
Y. Jia, H. Xu, and H. min, “Automatic acquisition of semantic selection restriction knowledge based on neural network,” Journal of Chinese Information Processing, vol. 31, no. 1, pp. 155–161, 2017.
View at: Google Scholar
J. An, K. Kim, L. Mortara, and S. Lee, “Deriving technology intelligence from patents: p,” Journal of Informetrics, vol. 12, no. 1, pp. 217–236, 2018.
View at: Publisher Site | Google Scholar
S. Muzaffar, P. Behera, and G. N. Jha, “A pfanalyzing case marker errors in English-Urdu machine translation,” Procedia computer science, vol. 96, no. C, pp. 502–510, 2016.
View at: Publisher Site | Google Scholar
T. Duman and A. S. Mattila, “The role of affective factors on perceived cruise vacation value,” Tourism Management, vol. 26, no. 3, pp. 311–323, 2005.
View at: Google Scholar
M. Grilo, I. S. Fadigas, J. G. V. Miranda, M. V. Cunha, R. L. S. Monteiro, and H. B. B. Pereira, “Robustness in semantic networks based on cliques,” Physica A: Statistical Mechanics and Its Applications, vol. 472, pp. 94–102, 2017.
View at: Publisher Site | Google Scholar
N. Xi, “Research on the design method of wireless music channel access network based on mobile Internet,” Revista De La Facultad De Ingenieria, vol. 32, no. 3, pp. 740–748, 2017.
View at: Google Scholar
M. Boulares and M. Jemni, “Learning sign language machine translation based on elastic net regularization and latent semantic analysis,” Artificial Intelligence Review, vol. 46, no. 2, pp. 145–166, 2016.
View at: Publisher Site | Google Scholar
Z. Xue, D. Zhang, W. Lina, and J. Hao, “An improved sentence segmentation model for machine translation,” Journal of Chinese Information Processing, vol. 31, no. 4, pp. 50–56, 2017.
View at: Google Scholar
M. C. Pattuelli and M. Miller, “Semantic network edges: a human-machine approach to represent typed relations in social networks,” Journal of Knowledge Management, vol. 19, no. 1, pp. 71–81, 2015.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Qifeng Gong. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

400

Downloads

464

Citations

Security and Communication Networks

Massive Machine-Type Communications for Internet of Things

[Retracted] Accurate Estimation of English Word Similarity Based on Semantic Network

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Semantic Network Model of English Words

3.2. Calculation Method of English Word Similarity Based on Semantic Network

4. Result Analysis and Discussion

5. Conclusion

Data Availability

Conflicts of Interest

References

Copyright