Complexity Problems Handled by Advanced Computer Simulation Technology in Smart Cities 2021View this Special Issue
Research Article | Open Access
Shan Xiao, Cheng Di, Pei Li, "Improvement and Analysis of Semantic Similarity Algorithm Based on Linguistic Concept Structure", Complexity, vol. 2021, Article ID 7322066, 11 pages, 2021. https://doi.org/10.1155/2021/7322066
Improvement and Analysis of Semantic Similarity Algorithm Based on Linguistic Concept Structure
With the rapid development of information age, various social groups and corresponding institutions are producing a large amount of information data every day. For such huge data storage and identification, in order to manage such data more efficiently and reasonably, traditional semantic similarity algorithm emerges. However, the accuracy of the traditional semantic similarity algorithm is relatively low, and the convergence of corresponding algorithm is poor. Based on this problem, this paper starts with the conceptual structure of language, analyzes the depth of language structure and the distance between nodes, and analyzes the two levels as the starting point. For the information of a specific data resource description frame type, the weight of interconnected edges is used for impact analysis so as to realize the semantic similarity impact analysis of all information data. Based on the above improvements, this paper also systematically establishes the data information modeling process based on language conceptual structure and establishes the corresponding model. In the experimental part, the improved algorithm is simulated and analyzed. The simulation results show that compared with the traditional algorithm, the algorithm has obvious accuracy improvement.
The rapid development of information age makes the world not to produce a large amount of data at all times. At the same time, text as an important carrier of modern information data, it is very important and meaningful to deal with and analyze it efficiently and reasonably. In order to further improve the data processing rate and processing accuracy, text similarity processing technology emerges at the historic moment, and its corresponding semantic similarity processing algorithm is more necessary and meaningful [1, 2]. The traditional semantic similarity algorithm is mainly used in intelligent data retrieval, automatic problem system of related data information, and corresponding text data retrieval system. The corresponding semantic similarity algorithm reflects different processing and analysis difficulties according to different languages [3, 4]. Generally speaking, the Chinese semantic similarity processing algorithm is more difficult than the English semantic similarity processing algorithm. The traditional semantic similarity algorithm has no high intelligence when dealing with the current Chinese and English related data. It mainly searches according to the keywords. At the same time, the corresponding machine is essentially free of any data when the data are large or the Chinese and English inclusion is mixed. The method can effectively understand the semantic relationship between the corresponding mass data, which affects the accuracy of the corresponding search and processing. Therefore, based on the above analysis and the improvement of the traditional semantic similarity algorithm, different data structures and data systems can be effectively interconnected so that the corresponding machine can fully understand the semantic association of data and finally realize that the effective processing of data becomes important and meaningful.
Based on the current data information processing-related issues, the most core processing consensus is semantic similarity processing. In view of this data processing technology, the popular data management has received corresponding attention according to technology. In essence, it is still the improvement of the traditional semantic processing algorithm. It unifies the relevant identifiers in semantic network, standardizes the corresponding data information format with the corresponding data resource description framework, and finally realizes the interconnection of various data. In the process of data management, the data management system is still improved in the traditional semantic processing algorithm to some extent and the association of isolated data is realized, thus forming the whole data network and finally forming the global data space. Based on such an organic whole, the search and discovery of semantic information can be realized quickly [5–8]. The above is the core processing algorithm of the current semantic similarity processing algorithm, and its corresponding shortcomings are obvious, but to some extent, it represents the development trend of data information processing. Other algorithms should also be based on this algorithm for subsequent improvement and optimization. The corresponding research on semantic similarity algorithm improvement will be introduced in the second section of this paper analysis . In order to solve the problems of the traditional semantic similarity algorithm, this paper studies the high-dimensional sparsity and accuracy. Based on this, this paper analyzes the depth of language structure and node distance from the perspective of linguistic concept structure and describes the information of framework type by using interconnection edge weights. From this perspective, the semantic similarity analysis of the entire information data is realized [10, 11]. Based on the above improvements, this paper also systematically establishes the modeling process of data information based on linguistic conceptual structure and establishes the corresponding model. In the experimental part, the improved algorithm is simulated and analyzed. The simulation results show that the proposed algorithm has a significant improvement in similarity judgment accuracy compared with the the traditional algorithm, and the problem of high-dimensional data sparsity has been further solved.
Based on the above analysis, the following arrangements will be made in the chapter: the second section of this paper will analyze the current research status of the semantic similarity algorithm and point out the advantages and disadvantages of various algorithms; the third section will trigger the semantic similarity from the depth of language structure and corresponding node distance based on the conceptual theory of language structure, and at the same time, it will focus on the linguistic overview. The modeling process of data information under read structure is standardized systematically; the fourth section of this paper will simulate and analyze the data of the improved algorithm; finally, the paper is summarized.
2. Related Work Analysis: Analysis of the Current Research Status of Semantic Similarity Algorithm
Aiming at the problems of search accuracy and convergence of the semantic similarity algorithm in data information processing, a large number of researchers and related entities have analyzed and studied it and put forward different optimization algorithms. In the literature on search accuracy, the relevant researchers in the United States proposed the corresponding vector space model implicit semantic search algorithm, which mainly represented the corresponding information data as the vector of the corresponding feature weight component. It simplified the complex relationship between the text and the corresponding keywords and realized the text represented by a simple vector. In practical application, the model uses weight to reflect the importance of keywords, but this semantic similarity processing algorithm has a large amount of calculation and has no advantage in the case of large amount of data [12, 13]; European scholars proposed to extend the existing semantic similarity algorithm, focusing on the semantic similarity algorithm between the corresponding sentences. In fact, it uses two-level dynamic programming technology to calculate the similarity of information data. It uses the dynamic programming algorithm to calculate the similarity between two sentences of different length, but the algorithm needs to balance the similarity between phrases in real time [14, 15]; relevant scholars have proposed machine learning or similar clustering algorithm. Advanced algorithm deconstructs the ontology model corresponding to data information text, but the improved algorithm under this advanced algorithm needs to analyze the whole huge data information database in advance and then build an ontology model of database. At this time, the quality of the corresponding database ontology model directly affects the good degree of the whole algorithm. In order to further solve the above research problems, the algorithm has been widely used in practice. Asian researchers summarized the shortcomings of the above research and proposed a similarity calculation algorithm based on text ontology. The corresponding algorithms mainly include the construction of similarity structure, accurate extraction of semantic content of data information, and similarity calculation. The algorithm has a certain application value, which solves the problem of the traditional semantic similarity algorithm to a certain extent, but there is still the problem of algorithm loss when the amount of data information is large [16, 17]; based on the structure of text information, relevant researchers propose a corresponding distance algorithm. This algorithm first calculates the corresponding distance length between the corresponding texts and identifies the corresponding ontology model between the two nodes with the farther distance, the smaller the similarity. The algorithm relies on a complete semantic dictionary and the corresponding hierarchical structure, but the algorithm model covers a small range of data text information and lacks advantages in the corresponding subdivision data text processing field [18, 19]. Based on the research of the distance algorithm, Chinese scholars propose to use semantic distance. At the same time, on this basis, a Chinese English hybrid semantic similarity algorithm is proposed, and the information theory-related knowledge is applied to practical application [20–23].
3. Improvement Analysis of Semantic Similarity Algorithm Based on Linguistic Structure Concept
In this section, aiming at the problems of the traditional semantic similarity algorithm, starting from the linguistic concept structure, we propose to analyze the semantic depth and distance to build an improved semantic similarity algorithm. The principle framework of the corresponding improved semantic similarity algorithm is shown in Figure 1. It can be seen from the figure that the core of the corresponding improvement lies in the semantic depth between different text data and the depth analysis between node distances. According to the information of frame types described by special associated data resources, it can be seen from the figure that the weight of interconnected edges is used for impact analysis. At the same time, it can be seen from the figure that the corresponding semantic similarity algorithm is based on a specific algorithm model.
3.1. Language Structure Depth and Node Distance Trigger Analysis
In order to solve the high-dimensional sparse problem and the corresponding accuracy problem of the traditional semantic similarity algorithm, this section uses linguistic concept structure to optimize the semantic similarity algorithm from the feature of text dependent information and corresponding node distance. The core principle architecture of the algorithm is shown in Figure 2. From the figure, we can see the principle and calculation method of the two optimization technology points of the improved algorithm.
As can be seen from Figure 2, the details of the corresponding deep analysis algorithm of language structure and the analysis algorithm of node distance of language structure are as follows.
3.1.1. Deep Analysis Algorithm of Language Structure
The depth analysis algorithm of language structure is mainly the depth analysis of the common problems of text information [24–26]. In essence, it uses the common amount of information between two or more texts and the corresponding common information nodes for depth analysis. Through the analysis of these common amounts, it realizes the similarity calculation and analysis between text words and designs the corresponding core scheme based on the application scenarios in this paper. The calculation of similarity between text words is shown in formula (1), where the corresponding Wi and Wj represent two different texts and the corresponding H(Wi) and H(Wj) represent the corresponding probability function of the corresponding text in the text vocabulary. In this situation, assuming that the corresponding text information is independent of each other, the corresponding text similarity probability function is the product of independent functions:
Based on the above core formula, the similarity between two words or different texts is calculated according to the similarity of information between different texts. The corresponding calculation formula is shown in formula (2), in which the corresponding overall similarity is used as the reference basis of the semantic similarity algorithm and the corresponding Ha (Wi, Wj) represents the set of shared information between different texts. From the corresponding calculation formula, when the corresponding text information is consistent, the corresponding similarity formula is 1.
Based on the above calculation, the corresponding flow chart of the language structure depth analysis algorithm is shown in Figure 3.
3.1.2. Trigger Analysis of Node Distance
In order to improve the accuracy of the semantic similarity algorithm, different text semantic node distances are introduced for auxiliary calculation. Different text information is arranged in one or more network structure diagrams according to the set organization rules, and the shortest connection distance between different texts is found based on this structure diagram [27–29]. Based on this judgment, the more sides the path passes through in the corresponding text, the longer the semantic distance between the corresponding two nodes, and the lower the similarity between two different texts, the less the number of corresponding paths, the higher the similarity. The flow chart of trigger analysis based on the corresponding node distance is shown in Figure 4.
According to the information of special data resource description frame type, the weight of interconnection edge is used for impact analysis [8, 30]. In practical analysis, the main factors that affect the weight calculation of interconnection edge include the depth of the layer where the connecting edge is located, the density attribute of the corresponding edge near the connecting edge, and the corresponding node out degree. The corresponding core calculation formula is shown in formula (4), where the corresponding Wi and Wj still represent different data texts:
Based on formula (4), the formula for calculating the semantic distance of the corresponding data text information node in the case of special data types is shown in the following equation:
In order to solve the problem that the traditional node frequency statistics rely too much on large-scale database, the text data domain model is used in the design, and the corresponding unified calculation process is as follows: Step 1: establish the tree model of the corresponding data text, and the corresponding model has and only has one root node Step 2: the relationship between the corresponding different tree views assumes that there is only the relationship between the upper and lower positions, and there is only one path between the corresponding different nodes Step 3: except for the root node of the tree view, only one parent node is set for other nodes Step 4: count all the child nodes
3.2. Modeling Process of Data Information Based on Linguistic Conceptual Structure
Based on the improvement of the semantic similarity reading algorithm, the paper also designs the data information association model between the texts. In the actual design, two levels of design content are considered, which correspond to the logical level and semantic relationship between different texts. The corresponding text information logic level mainly considers the relationship between information text and, or, and non, and the corresponding semantic level relationship mainly considers the semantic information-related situation between different data texts. The establishment process of the data information association model based on this correspondence is shown in Figure 5, among which the main association construction is in the logic level and semantic level. At the corresponding logical level, this paper mainly checks the attribute values between different texts and establishes the corresponding association between the same text attribute values; that is, the text with the same attribute is given the same identifier; on the semantic level, this paper is mainly used to construct or add some semantic relationship between different texts.
4. Experiment and Analysis
In order to verify the superiority of the algorithm, this part takes a huge knowledge base as the experimental blueprint for experimental simulation. The experiment mainly verifies the corresponding similarity value. In the experiment, three groups of three types of words are used to test the algorithm. The corresponding words and similarity calculation results are shown in Table 1, and the corresponding line graphs obtained from more experiments are shown in Figures 6(a)–6(c). It can be seen from the figure that the algorithm in this paper is more in line with the basic human cognitive law when judging the similarity between different texts.
It can be seen from Table 1 that when the corresponding words are nouns, adjectives, and verbs, the algorithm in this paper is obviously better than the traditional algorithm, but there is little difference in the accuracy of verb level similarity, which is also the direction of further research and improvement in this paper.
The similarity simulation experiment is carried out for the information of special data resource description frame type. The corresponding experimental object is the information number database of a certain data resource description frame type, and the corresponding information node attributes are author, nationality, number classification, and publishing year. In order to reflect the superiority of this algorithm in dealing with such data information, this paper selects another two common algorithms for comparative analysis in the actual experiment, the corresponding algorithms are RO algorithm and H algorithm, and the corresponding similarity calculation table is shown in Table 2, and the corresponding special type of data selected 8 types of books.
The corresponding similarity line graph is shown in Figures 7(a)–7(c). From the graph, it can be seen more intuitively that the algorithm in this paper still has the advantage of similarity calculation when dealing with special data types, and it is also more in line with the law of human judgment.
To sum up, the experimental analysis shows that this algorithm has obvious advantages compared with the traditional semantic similarity algorithm.
In this paper, the current information data processing and analysis problems are analyzed; at the same time, the semantic similarity algorithm proposed by the current relevant scholars is fully studied and analyzed; and the existing problems of the current related algorithms are pointed out. Based on this, in order to solve the problems of the traditional semantic similarity algorithm, this paper studies and verifies the high-dimensional sparsity and accuracy problems. Starting from the linguistic conceptual structure, this paper analyzes the depth of language structure and the distance between nodes. For special data resources to describe the information of frame type, we use the interconnection edge. The impact analysis of weight is carried out so as to realize the semantic similarity impact analysis of full information data. Based on the above improvements, this paper also systematically establishes the modeling process of data information based on linguistic conceptual structure and establishes the corresponding model. In the experimental part of this paper, the improved algorithm is simulated and analyzed. The simulation results show that the proposed algorithm has obvious accuracy improvement compared with the traditional algorithm. At the same time, the problem of high-dimensional data sparsity has been further solved. In the following research, this paper will further explore the essential meaning of linguistic structure and further analyze the semantic similarity algorithm based on the essential meaning so as to improve the accuracy of the algorithm and release the corresponding algorithm loss.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.
This study was supported by the National Social Science Foundation of China: Cross-Language Polysemy Sentence Analysis and Semantic Map Construction Research (18CYY003).
- Z. Sun, H. Huo, J. Huan, and J. S. Vitter, “Feature reduction based on semantic similarity for graph classification,” Neurocomputing, vol. 397, no. 20, pp. 114–126, 2020.
- K. Sanguri and A. Bhuyan, “A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain,” Scientometrics, vol. 8, no. 5, pp. 14–26, 2020.
- H. Safaeipour, M. H. F. Zarandi, and S. Bastani, “Using fuzzy ontology to improve similarity assessment: method and evaluation,” International Journal of Intelligent Systems, vol. 32, no. 11, pp. 1167–1186, 2017.
- G. Morota, T. M. Beissinger, and F. Peñagaricano, “MeSH annotation of the chicken genome: MeSH-informed enrichment analysis and MeSH-guided semantic similarity among functional terms and gene products,” Journal of Cell Biology, vol. 107, no. 5, pp. 1901–1909, 2016.
- “Novel metrics for computing semantic similarity with sense embeddings,” Knowledge-Based Systems, vol. 206, no. 4, Article ID 106346, 2020.
- M. J. Hussain, S. H. Wasti, G. Huang et al., “An approach for measuring semantic similarity between Wikipedia concepts using multiple inheritances,” Information Processing & Management, vol. 57, no. 3, pp. 102188.1–102188.19, 2020.
- O. Chergui, A. Begdouri, and D. Groux-Leclet, “Integrating a Bayesian semantic similarity approach into CBR for knowledge reuse in community question answering,” Knowledge-Based Systems, vol. 185, no. 9, pp. 104919.1–104919.13, 2019.
- S. Kim, N. Fiorini, W. J. Wilbur, and Z. Lu, “Bridging the gap: incorporating a semantic similarity measure for effectively mapping PubMed queries to documents,” Journal of Biomedical Informatics, vol. 75, no. 6, pp. 122–127, 2017.
- M. Kulmanov and R. Hoehndorf, “Evaluating the effect of annotation size on measures of semantic similarity,” Journal of Biomedical Semantics, vol. 8, no. 1, pp. 7–17, 2017.
- N. H. Tien, N. M. Le, Y. Tomohiro et al., “Sentence modeling via multiple word embeddings and multi-level comparison for semantic textual similarity,” Information Processing & Management, vol. 55, no. 6, pp. 102090.1–102090.11, 2019.
- K. M. Jozwik, N. Kriegeskorte, and M. Mur, “Visual features as stepping stones toward semantics: explaining object similarity in IT and perception with non-negative least squares,” Neuropsychologia, vol. 83, no. 6, pp. 201–226, 2016.
- Song, Yang, Cai et al., “Pairwise latent semantic association for similarity computation in medical imaging,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 5, pp. 1058–1069, 2016.
- Y. Wang, X. Duan, X. Liu, C. Wang, and Z. Li, “Semantic description method for face features of larger Chinese ethnic groups based on improved WM method,” Neurocomputing, vol. 175, no. 1, pp. 515–528, 2016.
- F. Sakketou and N. Ampazis, “A constrained optimization algorithm for learning GloVe embeddings with semantic lexicons,” Knowledge-Based Systems, vol. 195, no. 7, Article ID 105628, 2020.
- M. Laurent and M. Seminaroti, “Similarity-first search: a new algorithm with application to Robinsonian matrix recognition,” Siam Journal on Discrete Mathematics, vol. 31, no. 3, pp. 1765–1800, 2016.
- R. K. Roul, “An effective approach for semantic-based clustering and topic-based ranking of web documents,” International Journal of Data Science and Analytics, vol. 5, no. 3, pp. 1–16, 2018.
- Y. Ren, Q. Li, W. Liu, L. Li, and W. Guan, “Semantics characterization for eye shapes based on directional triangle-area curve clustering,” Multimedia Tools and Applications, vol. 78, no. 18, pp. 25373–25406, 2019.
- C. Ventura, D. Varas, V. Vilaplana, X. Giro-i-Nieto, and F. Marques, “Multiresolution co-clustering for uncalibrated multiview segmentation,” Signal Processing: Image Communication, vol. 76, no. 5, pp. 151–166, 2019.
- D. Zhang, K. Lee, and I. Lee, “Hierarchical trajectory clustering for spatio-temporal periodic pattern mining,” Expert Systems with Applications, vol. 92, no. 4, pp. 1–11, 2018.
- N. Mohsin and S. Payandeh, “Clustering and identification of key body extremities through topological analysis of multi-sensors 3D data,” The Visual Computer, vol. 30, no. 4, pp. 1–24, 2021.
- Y. Jing and J. Wang, “Tag clustering algorithm LMMSK: improved K-means algorithm based on latent semantic analysis,” Journal of Systems Engineering and Electronics, vol. 28, no. 2, pp. 374–384, 2017.
- L. Yang, J. Gu, and H. Chen, “Clustering algorithm based on semantic distance for XML documents,” Database Technology & Applications First International Workshop on, vol. 36, no. 5, pp. 549–552, 2009.
- Q. Zhang, H. J. Wang, and L. W. Wang, “Short text clustering algorithm combined with context semantic information,” Computer Science, vol. 11, no. 2, pp. 11–19, 2016.
- J. J. Chen and M. F. Liu, “Does the Internet expand the educational gap among different social classes: the protective role of future orientation,” Frontiers in Psychology, vol. 12, p. 1255, 2021.
- J. Yang, Y. Zhao, J. Liu et al., “No reference quality assessment for screen content images using stacked autoencoders in pictorial and textual regions,” IEEE Transactions on Cybernetics, pp. 1–13, 2020.
- W. Wang, Z. Gong, J. Ren et al., “Venue topic model–enhanced joint graph modelling for citation recommendation in scholarly big data,” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), vol. 20, no. 1, pp. 1–15, 2020.
- B. Yang, X. Cheng, D. Dai, T. Olofsson, H. Li, and A. Meier, “Real-time and contactless measurements of thermal discomfort based on human poses for energy efficient control of buildings,” Building and Environment, vol. 162, Article ID 106284, 2019.
- M. Zhang, W. Jing, J. Lin et al., “NAS-HRIS: automatic design and architecture search of neural network for semantic segmentation in remote sensing images,” Sensors, vol. 20, no. 18, p. 5292, 2020.
- L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic SLAM based on object detection and improved octomap,” IEEE Access, vol. 6, pp. 75545–75559, 2018.
- G. Sun, S. Zhao, and J. Meng, “Edge fault tolerance of interconnection networks with respect to maximally edge-connectivity,” Theoretical Computer Science, vol. 758, pp. 9–16, 2019.
Copyright © 2021 Shan Xiao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.