Advanced Pattern Recognition Systems for Multimedia DataView this Special Issue
Design of Chinese Corpus Based on Semantic Mining Algorithm
In order to improve the practical effect of the Chinese corpus, this paper combines the semantic mining algorithm to design the Chinese corpus, proposes an ontology adaptive algorithm based on content learning, and conducts in-depth research on the model of the algorithm. Firstly, in view of the heterogeneity of web information structure and the chaotic nature of information organization, this paper proposes a web content extraction method to effectively remove noise information. Moreover, this paper analyzes an ontology-based text content parsing method, which uses the semantic parsing capability of the ontology to improve the semantic parsing capability of the text. Secondly, this paper proposes the BFA algorithm to optimize the BP neural network. The experimental research shows that the Chinese corpus system based on semantic mining proposed in this paper has a good practical effect and meets the actual needs of Chinese teaching and translation.
With the advent of the information age, people are always faced with a large amount of electronic information. Among them, the text information is mainly based on text. There are both structured databases and semistructured web pages, as well as numerous unstructured electronic texts. How to deal with large-scale text information has become one of the most realistic challenges of human digital existence. Faced with such challenges and opportunities, different disciplines will solve problems from their own perspectives, thereby enriching and improving their theoretical systems. However, people may also take advantage of the conditions provided by the new development stage to develop new disciplinary directions. It should be said that corpus linguistics is an emerging linguistic discipline that emerged in such an era.
The design of the corpus is a crucial link in the construction of the corpus, which directly affects the quality and use of the corpus. Based on the fact that domestic scholars have created special corpora and based on the objective reality and practical needs of ordinary teachers and students, we build a teaching corpus with practical, novel, convenient, and open characteristics as the main goals .
The construction of the corpus is a huge project, which is time-consuming and labor-intensive and requires a lot of human and material resources. The existing domestic and foreign corpus resources and corpus software resources are relatively rich, but due to the openness and sharing of resources, a large number of resources are only in the hands of a few researchers or units, which greatly restrict the development of corpus linguistics. Popularization and development have also caused serious waste of resources and duplication of resources . The design of the corpus fully considers the openness and sharing of the library . It is planned to develop retrieval software based on the WIKI platform to run on the internet. On the one hand, through the WIKI platform to achieve resource co-construction, it fully mobilizes peers to participate in the collection, sorting and screening of the corpus, and manual annotation of the corpus, realizes dynamic update and real-time supplement of the corpus, and expands the library capacity as much as possible on the premise of ensuring the quality of the corpus, making the library more representative. On the other hand, resource sharing is realized by running retrieval on the internet .
The bilingual parallel corpus for self-built teaching has a clear corpus collection goal, which is beneficial to improve the efficiency of database construction to a certain extent. In addition, the self-built bilingual parallel corpus can also be operated on the local computer in the form of a web page, and the retrieval and updating of the corpus are relatively free, which can be accumulated and retrieved to the greatest extent for teachers and students, and the self-designed corpus can be based on the actual teaching situation. The requirements are flexible and can be updated at any time without restrictions on operation authority . Due to copyright restrictions or high corpus maintenance and management costs, many large-scale corpora are not open to the public or free for researchers to use, but self-built corpora are subject to relatively few restrictions and are practical for language learners .
Corpus has a speech corpus and a text corpus. The spoken language corpus is a phonetic corpus. Spoken speech can come from text-speaking speech, conceived fluent speech, and natural speech . Speech corpus has a wide range of applications in the fields of language teaching, language research, multilingual communication, information service, speech recognition, speaker recognition, endangered languages, and oral culture preservation and development . The international community attaches great importance to the construction and development of speech corpus. For example, the Language Data Consortium (LDC) of the University of Pennsylvania has released many corpora of different scales .
The key to protecting and promoting human linguistic and cultural diversity lies in the continuous learning and inheritance of all languages and cultures. From this perspective, the spoken dialect corpus should be able to provide dialect groups with various corpora needed for learning, disseminating, and inheriting their native dialect . This is a basic requirement. Second. The dialect spoken language corpus should be able to satisfy the linguistic research of the dialect and promote the theoretical and practical innovation of the Chinese dialectology and general linguistics. The vitality of dialect lies in natural discourse, and its profundity and subtlety are also in discourse . In terms of dialectology research, if it cannot comprehensively and deeply investigate and study discourse, it is impossible to have real theoretical and practical innovations and to form its own unique disciplinary charm, and it will never get rid of the status of phonology as a vassal . In fact, for decades, dialect studies have not broken through the barriers of character, word, sentence survey, and record and phonological analysis, and few people have spent energy to survey and record natural discourse. Therefore, the discourse corpus is very scarce, the discourse research of dialect is almost blank, and dialect research always has a sense of seeing the trees but not the forest .
This is the weakness of the Chinese dialect studies. Therefore, the spoken Chinese dialect corpus should provide a corpus for dialect pronunciation phonetics, acoustic phonetics, auditory phonetics, word phonetics, and phonology analysis, as well as for dialect grammar (syntax) research, pragmatic research, discourse analysis, prosody analysis, semantic relations, and speech behavior analysis, providing rich discourse corpus.
In order to adapt to a wide range of linguistic research purposes, oral corpora should reflect dialect facts and dialect practices as comprehensively as possible. The corpus should also be beneficial to comprehensively carry out comparative research on the national common language and dialect and provide a reference for scientific formulation of national language planning and language policy . The dialect natural spoken language corpus should be able to adapt to the basic phonetic application engineering needs . An example is the basic test corpus for language communication. This kind of corpus can be used as the basic analysis corpus of speech synthesis, or as the test speech corpus of speech recognition or speaker recognition. In addition, this corpus should be able to provide enough corpus to extract language personality features, such as characteristic speech, characteristic words, characteristic grammar, and discourse markers. At the same time, the corpus provides a resource guarantee for maintaining the security of national language information . Finally, the dialect natural spoken language corpus should meet the basic needs of cultural, anthropological, and artistic research on regional dialect materials and knowledge content. It has become a repository of folk traditional culture. At the same time, this kind of corpus can make the public understand the language and national conditions, popularize language knowledge, and improve the public's awareness of the use of language resources. In addition, it can also make the public aware of the relationship between linguistic diversity and cultural diversity, and promote the dissemination of excellent traditional culture and regional culture .
The core of the natural spoken language corpus is that the corpus must be a natural spoken language, that is, nonspeaking utterances that are not related to text or instant utterances. Most regional dialects exist in the form of natural discourse . Judging from the spoken language corpora (including the Chinese dialects and ethnic languages) that can be seen so far, the naturalness of spoken language corpus is insufficient, the coverage of genre and language is relatively narrow, and the content of discourse tends to be predesigned or limited to the scope of script-guided themes. Although this type of discourse also reflects linguistic facts and linguistic practices to a certain extent, it is a quasi-natural discourse . Natural discourse is not necessarily simple, repetitive, incoherent, or random. Whether the discourse is natural or not is closely related to the language situation and language behavior. Natural discourse also encompasses a variety of genres and styles.
In order to improve the effect of the Chinese teaching, this paper combines the semantic mining algorithm to design the Chinese corpus to improve the construction and development of the Chinese database and to promote the Chinese teaching and the Chinese dissemination.
2. Data Semantic Mining of the Chinese Corpus
2.1. Research on Ontology Adaptive Algorithm Based on Content Learning
Due to the powerful semantic description ability of ontology, the application of ontology in topic crawler can solve the problems of poor semantic parsing ability and inability to analyze polysemy of current topic crawler.
The internet is currently the largest open data resource warehouse. Due to the openness of the internet environment, the diversity of participants, and the heterogeneity of resource information, the content structure of web page information is messy and there is no uniform format. It is necessary to process the content information of the web page for subsequent processing.
In this paper, a comparative analysis method of DOM (Document Object Model) structure is proposed to extract web content information. The overall flow of the algorithm is shown in Figure 1.
This article uses the HtmlParser tool to parse web documents. HtmlParser can automatically complete missing HTML tags and parse web documents into DOM trees.
Figure 2 provides an example of the process of pruning and compares nodes at similar positions in the DOM tree. As shown in Step 2, the dashed part represents the same structure. Since there is a different structure (dark part), it means that the two nodes are not the same. In Step 3, two nodes have the same structure and need to perform a pruning operation, which is repeated continuously until all subtrees with the same structure are found and a pruning operation is performed on these subtrees.
The information extraction algorithm based on pruning technology effectively improves the accuracy of topic information extraction and reduces the influence of noise information on the calculation of web topic relevance.
Although the pruning process of each page improves the accuracy of extracting the topic information of the web page, it will reduce the performance of the topic crawler at the same time. On the basis of the above algorithm, in order to further improve the performance of the crawler system, the pruning path is saved. The pruning template is defined as follows.
Definition 1. Front Branch Template. We give a template , where id saves the serialized information of , which is used to identify the template. Usually, md5 technology is used for serialization, is a subsequence of url, the information after removing the file name and parameters represents the path information of the current template, and xpath represents the path in the web page document.
The current web page structure is complex, and the comparison operation based on the DOM structure takes a lot of time. In this case, using the pruning template in Definition 1 can effectively reduce the time cost. is serialized using md5 technology to obtain serialized information. The serialized information is compared to an existing template id. When there is the same id as the serialized information, the template corresponding to the id is used for pruning; otherwise, the DOM structure is compared.
In the ontology hierarchy tree, there is a certain semantic relationship between two concepts. Among them, the three most common relationships are the synonym relationship (two nodes with a synonym relationship are usually located at the same level in the ontology hierarchy tree), the inheritance relationship (two nodes with an inheritance relationship are usually a parent-child relationship in the ontology hierarchy tree), and whole-part relationship (two nodes with whole-part relationship are usually descendants in the ontology hierarchy tree). represents the semantic relationship between two nodes, represents the synonym relationship between the two concepts, represents the inheritance relationship between the two concepts, and represents the whole-part relationship between the two concepts.
In the ontology hierarchy tree, concepts are organized from the top to bottom according to the decreasing amount of information. The higher the amount of information contained in the top-level concept, the smaller the amount of information contained in the lower-level concept and the smaller the weight of the edges between concepts, which is expressed byIn formula (1), represents the level of the edge between concept i and concept j in the ontology hierarchy tree. It can be seen from the formula that when is larger, the meaning of the node is more specific, the information is more detailed, the corresponding amount of information contained is less, and the weight is also smaller.
In the ontology hierarchy tree, the degree of node refinement has an important influence on the concept weight. The higher the degree of refinement of the node, the greater the amount of information the node contains to a certain extent and the greater the conceptual weight of the node. Moreover, the degree of refinement of a node can be approximated by the number of subtrees of the node, and the node density is calculated usingIn formula (2), represents the number of subtrees of node i and represents the maximum number of subtrees in the tree. It can be found from the formula that when the number of subtrees contained in node i is more, it means that the amount of information contained in node i is more and the weight of is greater.
The above measures are used to comprehensively measure the similarity of concept distance, as expressed by formulas as follows:The depth of a node reflects the amount of information a node contains to a certain extent. When calculating the similarity between two nodes, if the depth difference between the two nodes in the ontology hierarchy tree is larger, it indicates that the conceptual similarity between the two nodes is smaller, which is expressed byIn formula (5), represents the depth of node i in the ontology hierarchy tree, wherein the depth of the root node is 1. From formula (5), it can be found that when the depth difference between two nodes is large, it means that the similarity between two nodes is low, which is in line with the actual situation. The nodes in the ontology hierarchy tree are organized from the top to bottom according to the amount of conceptual information. If the depth gap between the two nodes is large, the information content gap between the two nodes is large, indicating that the similarity between the nodes is low. For example, in the hierarchical tree of real estate information ontology, real estate is the root node, ordinary residence is a high-level node, and the unit type is the next-level node. The node similarity relationship between the three can be expressed as (real estate, house type) (ordinary house, house type).
The semantic coincidence of a concept mainly refers to the semantic similarity of two concepts, but it is impossible to directly judge the semantics of the concepts. Ontology has powerful semantic description ability and organizes concepts in the domain according to the relationship between concepts from the top to bottom. Therefore, the semantic relationship can be approximately represented by the path of the concept in the ontology hierarchy tree, and the path from the root node to the two concept nodes can be found, respectively. The conceptual path is represented by the node number sequence, and the proportion of the same node in the path to the number of the longest path is compared. When the paths of the two nodes have a high degree of coincidence, it indicates that the similarity between the two concepts is large. It is expressed usingIn formula (6), represents the set of nodes on the path from the root node to node and similarly represents the set of nodes on the path. By performing the intersection operation on the two sets, the same nodes in the two sets are obtained. When the two sets contain more identical nodes, it indicates that the semantic coincidence of the two nodes is higher and the conceptual similarity is greater.
On the basis of the above factor analysis, combined with the characteristics of each factor to calculate the similarity between the two concepts, the comprehensive calculation model is shown inIn formula (7), are the weighting factors and there is , where , represent conceptual distance similarity, node depth similarity, and semantic coincidence, respectively. Considering the influence of various factors on the calculation of concept similarity, it can reflect the relationship between concepts more comprehensively.
2.2. Ontology Adaptive Algorithm Based on Content Learning
The content learning-based ontology adaptive algorithm can perform statistical learning on the topic-related information captured by the topic crawler. First, it preprocesses the crawled web page information, extracts relevant feature information, uses the learning model to organize knowledge, and optimizes the conceptual structure in the ontology topic model. The model structure of the adaptive learning algorithm is shown in Figure 3.
In the ontology knowledge base, the ontology concepts are organized in a hierarchical form. In order to unify the expression form, the ontology hierarchy tree is formally defined according to the definitions of the concepts in the previous section.
Definition 2. Ontology Hierarchy Tree. The ontology hierarchy tree represents the ontology hierarchy tree as a triple, in which C represents the concept set in the ontology hierarchy tree, R represents the concept relation set in the ontology hierarchy tree, and H represents the function. Among them, represents a class function and represents that is a subset of .
For the information in the metadata database, the word set information is matched and the word information in the metadata is used to perform the word information matching search in the ontology hierarchy tree. If the word information exists in the ontology hierarchy tree, the word is marked as an ontology concept vocabulary. If there is no matching information, it is marked as a learning vocabulary and the original word set will be divided into two word sets after being processed by the matching algorithm, namely, the ontology concept vocabulary set and the learning vocabulary set . The processing process of the algorithm is shown in Figure 4. The dots in the figure represent the words in the word set, the tree structure on the right represents the ontology hierarchy tree, the nodes in the tree represent the concepts in the ontology hierarchy tree, and the edges represent the relationship between the concepts. After matching, two word sets are obtained.
In this paper, the Skip-gram model is introduced to process the text information, the words in the text are vectorized to represent the words according to the context information, and Skip-gram processes the context of the words. Moreover, this paper calculates a series of training words and finds the maximum average probability ofHere, T represents the training context, that is, the corpus, and represents the number of total vocabulary, where is calculated using a regression algorithm, as shown inHere, and are the input and output vectors and represents the number of words in the vocabulary. The calculation time of in formula (9) is proportional to , so that formula (9) can be used in large-scale corpus calculations.
After obtaining the vectorized representation of the words, each word is mapped into an N-dimensional vector space. The relationship between words can be represented by the distance between the vectors or the angle between the vectors. Unsupervised machine learning algorithms can also be used to classify the vectorized words. Since the word vector of distributed representation does not have the sparsity problem, the vector space model is used in this paper to calculate the similarity between the two words as shown inThe similarity calculation is performed on the words in the ontology concept vocabulary set and the learning vocabulary set , the maximum similarity value is calculated for each word in the concept vocabulary set, and a result set is obtained.
2.3. Research on the Topic Crawler Relevance Algorithm Based on BFA
The content information of the web page is extracted to obtain the text information of the web page, the text word set is obtained after the word segmentation and noise information filtering, and statistical text word frequency information is processed on the web page text information. The processing process is shown in Figure 5.
Figure 5 shows the processing work of each step. After text preprocessing, a web page text word set is obtained and the text word set is defined.
Definition 3. Web Page Text Word Set. The webpage text word set represents the webpage text word set as a two tuple. Here, W represents the set of words in the webpage text vocabulary, represents the i-th word of the vocabulary, and represents the statistical information set of the webpage text vocabulary, and S is a two tuple. Here, F represents the word frequency information set of the words in the webpage text vocabulary, and Q represents the weight set of the words. The word frequency information of is , and the word weight information is .
The artificial fish swarm has the advantages of simple structure, fast convergence speed, and efficient optimization. However, it also has disadvantages. Artificial fish swarms have random behaviors. Too much random behavior will cause the algorithm to converge slowly and affect the performance of the algorithm. At the same time, the optimization results have a strong dependence on the initialization parameters Visual and Step. When Visual and Step are large, it has excellent global optimization ability, but it is not conducive to local optimization, and artificial fish will linger near the optimal value, resulting in oscillation. Moreover, when Visual and Step are small, it has excellent local optimization ability and weak global optimization ability, which leads to slow convergence of the algorithm and easy to fall into the local optimal value.
Figure 6(a) shows the oscillation phenomenon of the artificial fish during the optimization process. The star points represent the optimal value, and the circle points represent the artificial fish. When the artificial fish finds the optimal value within the field of view, it will move in the direction of the optimal value. Due to the large value of the moving distance parameter Step of the artificial fish, if the moving distance is too large, the artificial fish will wander back and forth near the optimal value, which reduces the accuracy of the algorithm and increases the optimization time.
Aiming at the above phenomenon, this paper proposes a dynamic factor to dynamically modify the artificial fish parameters Visual and Step. In the early stage of the algorithm, the larger values of Visual and Step are set when initializing the parameters, so that the manual has excellent global optimization ability. As the number of iterations increases, this paper modifies the values of the parameters Visual and Step and uses the weight factor to set the parameters, . Moreover, this paper sets the thresholds and and standardizes the parameters Visual and Step, so that is as shown in Figure 6(b).
The number of neurons in the input layer and the intermediate layer is normalized, as shown inIn formulas (11) and (12), |I| and |M| represent the number of neurons in the input layer and the intermediate layer, respectively, represents the number of neurons in the input layer, and represents the number of neurons in the intermediate layer. The accuracy of the BP neural network is measured by the error E, and E represents the fitting degree of the BP neural network to the information in the data set. When the value of E is relatively large, it means that the fitting degree of the BP neural network and the data is low, the corresponding accuracy rate is low, and the objective function is as shown inHere, are the adjustment parameters. For the error value , the calculation method of E is shown inIn formula (14), represents the actual output value of the i-th neuron in the output layer and represents the expected output value. The parameter Visual in the artificial fish swarm calculation represents the field of view of the artificial fish. When judging whether the artificial fish are visible or not, the distance between the artificial fish needs to be calculated. The formula is shown inFigure 7 shows the structure design process of artificial fish. The BP neural network prototype is analyzed to obtain the optimization objective. In this paper, we mainly optimize the structure of the input layer neurons and the intermediate layer neurons in the BP neural network under the premise of ensuring the correct rate. After the optimization objective is determined, the objective is parameterized to obtain the artificial fish F, and the artificial fish code is constructed. Before the artificial fish performs structure optimization, it is necessary to initialize the artificial fish swarm and map the artificial fish into a high-dimensional space. The optimization space in this paper is a three-dimensional space (x, y, E). The above work completes the initialization of the artificial fish swarm.
The artificial fish swarm algorithm simulates the behaviors of fish swarming tail, gathering, and foraging and searches for the optimal target value in the optimization space. The optimization process is shown in Figure 8.
Figure 8 shows the behavior of artificial fish in the optimization process, such as feeding, tail-chasing, and gathering, among which feeding behavior is the basic behavior of artificial fish.
3. Design and Experimental Research of the Chinese Corpus
When building the platform, the designed platform structure includes noise reduction, feature block extraction, directory tree extraction, keyword and word segmentation, word frequency statistics and dictionary compilation, segmentation, and timely filtering of stop words, as shown in Figure 9.
The structure of the database is shown in Figure 10. It has an administrator system, a Chinese-English bilingual corpus management system, and three corpus systems (vocabulary, long sentence, and chapter). Administrators have user management rights and corpus management rights. The user only has the corpus input and correction rights.
Based on the above analysis, from the perspectives of semantic mining and corpus evaluation, the effect of the Chinese corpus based on the semantic mining algorithm proposed in this paper is verified. Moreover, this paper obtained the experimental results shown in Figures 11 and 12 through the simulation test statistics.
It can be seen from the above research that the Chinese corpus system based on semantic mining proposed in this paper has a good practical effect and meets the actual needs of Chinese teaching and translation.
Corpus is one of the most important resources in contemporary linguistics and natural language processing. On the productive area of the corpus, we can carry out positivist linguistic research and explore humanities issues such as vocabulary, grammar, pragmatics, and translation. However, we can also go the other way, which is to treat the corpus as a loose knowledge base and let the computer learn various kinds of knowledge from it. In addition, we can regard the corpus as a sampling representative of an infinite, continuously generated text collection and study the establishment and evaluation of language information automatic processing systems. This paper combines the semantic mining algorithm to design the Chinese corpus to improve the construction and development of the Chinese database. The research shows that the Chinese corpus system based on semantic mining proposed in this paper has a good practical effect and meets the actual needs of the Chinese teaching and translation.
The labeled dataset used to support the findings of this study is available from the author upon request.
Conflicts of Interest
The author declares no conflicts of interest.
This study was sponsored by Henan Polytechnic.
C. Wang, Y. Zhao, and D. Sun, “Research on design and sharing of yi language corpus resources database based on syntactic rules,” Solid State Technology, vol. 63, no. 5, pp. 10563–10574, 2020.View at: Google Scholar
M. Esplà-Gomis and A. Sentí, “Presentació del monogràfic «Spoken Corpus Linguistics in Romance: thoughts, design and results,” Caplletra. Revista Internacional de Filologia, vol. 69, pp. 117–123, 2020.View at: Google Scholar